Building EPCC Safe Haven Services 2.0

22 February 2024

In 2023 EPCC began building a new Trusted Research Environment to provide tenants of Safe Havens with more flexibility in the applications they can run on sensitive data. 

SHS 2.0 Service Core

The 2.0 Trusted Research Environment (TRE) service is a key element in delivering Safe Haven services. It is substantially different to the previous version as it incorporates significant changes and innovations in TRE service delivery compared to how Safe Havens were delivered during the Covid years.

Users of the old TRE found the application environment too restricted to perform data science, and in Safe Haven 2.0 we have addressed this as follows.

Private Project Zones (PPZs)

Whilst continuity in security controls has remained important, the design of the Safe Haven service (SHS) 2.0 has been informed more by external requirements and trends in data science practice. The new TRE infrastructure is organised as a set of isolated tenant areas known as Private Project Zones (PPZs). A PPZ is a virtual infrastructure built directly on a bare metal server and consists of a virtual firewall appliance and three subnets. The configuration convention is to use one subnet for the provision of engineering data to research projects, one subnet for hosting the research projects, and one subnet for managing connections to resources outside the PPZ. 

The PPZ is a building block for a Safe Haven service within the TRE. Tenants are allocated at least one PPZ to create their Safe Haven service and service capacity is expanded by adding additional PPZs as required. This refactoring of the design is an important step as it gives tenants the option of hosting their service in an entirely private space, the PPZ, in which to apply their own information governance independently with no shared physical or virtual infrastructure, and minimal shared virtual infrastructure and support services from the TRE. It also creates the potential for tenants to operate cyber security controls on their PPZ’s physical and virtual infrastructure that are more restrictive than those operated for the TRE as a whole.

Shared compute platforms 

The PPZ design is an effective mechanism for isolating tenants, but it prevents use of compute and storage resources that can only be provided at scale on platforms that are shared and therefore multi-tenant. 

The TRE core network connects the shared, TRE-level services and the PPZs, and includes a large memory HPC system with an integrated, high performance parallel file system, and Kubernetes GPU cluster. These shared compute platforms scale well beyond the standard desktop server VMs provisioned in the PPZs and offer tenants the option of significantly greater compute and data intensive processing power with the added risk of using shared services that are outside their PPZ yet are still within the TRE.

Managing risk

The segmentation of TRE into Private Project Zones supports the trend towards more permissive software policies within the academic TREs. Data science is essentially a community powered activity enabled by internet access and shared practices. Researchers are highly dependent on data processing pipelines, analysis workflows, and pre-trained ML models that are built and improved by expert, special interest groups in a global community. Sufficiently open access to pull code from community repositories like CRAN and PyPI is now expected and this additional risk, and individual tenant appetite for it, can be managed to a large degree through the PPZ model. One tenant may have a default open access policy granting all projects access to CRAN and PyPI whilst another may not permit access to either.

Container import approach

The use of containers is another approach to enabling open data science in TREs without the security risks of user access to external community repositories directly from Safe Havens. 

The container import approach is a key innovation of the OpenSAFELY TRE model, which was introduced during the pandemic to simplify access to NHS England patient data. Researchers assemble and package software in containers, and import these into the TRE with the entire pre-built software stack required by the project. This has the benefit of removing software development processes and tools from the TRE, eliminating the need for TRE access to community software repositories. Container import and execution is an important addition to the EPCC TRE in Safe Haven services 2.0.

Benefits of batch job processing

The OpenSAFELY TRE model has other significant benefits for data risk management. It promotes batch job processing over interactive data analysis providing more detailed oversight of user activity than in interactive environments. The batch job workflow also gives tenants the option to provide access to sensitive data whilst preventing researchers from ever actually seeing it, thereby guaranteeing data privacy. Once again, the use of OpenSAFELY workflows in the TRE, with or without researcher visibility of sensitive data, is an option for individual tenants enabled by the new PPZ model.

Demand-led service

SHS 2.0 is a transition from Safe Havens as an infrastructure-focused service to a secure data science service that is more open and application- and data-centric. This is demand-led and a reflection of a change in priorities that seeks to maximise the value of sensitive data to research to the greatest extent possible whilst retaining a strong security posture for the data that is shared in the TRE.

Author

Donald Scobbie