Edinburgh International Data Facility: open for public data

12 September 2024

Operated by EPCC, the Edinburgh International Data Facility (EIDF) provides a comprehensive set of data services through a large private cloud service. The EIDF team has now launched a suite of services to support collaborative data science.

Simple Storage Service 

EIDF now supports Simple Storage Service (S3) [1], which enables users to easily bring data in to and take data out of EIDF. Moreover, S3 gives users of EIDF a simple protocol to move data around its heterogeneous compute services. 

For example, you can fill an S3 bucket from another cloud provider then process data from the bucket in a Jupyter Notebook before running an intensive GPU-accelerated machine learning task over the data in that bucket. 

Sharing data  

EIDF now enables the sharing of data with other researchers and innovators as part of the process of applying for a project on the EIDF Portal [2]. The EIDF S3 service is a critical component as donated data sets will be made available as S3 buckets. Researchers can discover your data sets on the newly-launched EIDF Data Catalogue [3]. 

Data donation 

The first donation of data was by Prof. Henry Thompson of the University of Edinburgh School of Informatics. Prof. Thompson has augmented the Common Crawl dataset by adding timestamps to its index files.  

Common Crawl is a multi-petabyte longitudinal dataset containing over 100 billion web pages which is widely used as a source of language data for sequence model training and in web science research.

Easy access to donated data 

Making use of donated data is as easy as referencing its S3 bucket in a compute service either on the EIDF or on another system. For example, let’s say we want to see a description and distribution of certain variables in Henry Thompson’s data. We can use the newly-launched Jupyter Notebook service [4] in a web browser. If we execute a few lines of Python we get descriptive statistics and a distribution of one of the variables in a plot. 

EIDF Notebook does not require you to log in to a terminal or virtual desktop. For a demo of this Jupyter Notebook on EIDF see the screencast [5]. If you want to access these data yourself all you need is its unique EIDF S3 URL [6]. This URL is on its respective EIDF Catalogue page [7]. 

Supporting collaboration 

Next in store for EIDF will be the supporting of greater collaboration. We will launch an EIDF Gitlab, which will enable teams to develop code and documentation together. These can then be accessed in EIDF compute services such as Jupyter Notebooks to collaborate on code and documentation.

To explore how EIDF can support your research, email us at: eidf@epcc.ed.ac.uk

Links

[1] https://s3.eidf.ac.uk 

[2] https://portal.eidf.ac.uk 

[3] https://catalogue.eidf.ac.uk

[4] https://notebook.eidf.ac.uk

[5] https://media.ed.ac.uk/media/t/1_ o5psqjph 

[6] https://s3.eidf.ac.uk/eidf125-cc-main-2019-35-augmented-index 

[7] https://catalogue.eidf.ac.uk/dataset/ eidf125-common-crawl-url-index-for-august-2019-with-last-modified-timestamps/resource/7e485f0c-

Edinburgh International Data Facility website

Data-Driven Innovation Initiative

Author

Dr Jano van Hemert
Jano van Hemert