Powering down computer rooms at the ACF

18 June 2024

The Advanced Computing Facility recently underwent a week of major works to ensure it continues to operate at peak efficiency. Calum Muir, HPC Systems Data Manager, describes the careful planning required by this complex project.

Men working in control room

Above: Inspection, testing and cleaning of Plant Room B Electrical Section Board.

Critical data centres rely on their Mechanical and Electrical (M&E) infrastructure to support their computer systems. Regular inspection, testing and maintenance of this infrastructure are required to ensure the highest level of system resilience is provided.

However, finding a suitable time slot to carry out this level of maintenance can be difficult or almost impossible to implement because most of these Distributed Control System (DCS) systems are in constant use.

The Advanced Computing Facility (ACF) is the high performance computing data centre of EPCC. When recently carrying out maintenance on one of the ACF site's Uninterruptible Power Supply (UPS) systems, we discovered that due to its age and lack of necessary spare parts we would need to replace it. Because this UPS supports most of the mechanical services providing cooling to two of our computer rooms, there was no way to carry out the replacement without closing down both of these rooms and all of the computer services they contain. 

Although it is not overly difficult to close down all of these mechanical and electrical systems, it is complex and time consuming for EPCC staff to liaise with all the different stakeholders who use the computer systems to enable them to close down simultaneously.

High Voltage specialist removing a 11000Volt Circuit Breaker for Inspection & Maintenance.

Above: High voltage specialist removing a 11000 volt circuit breaker for inspection and maintenance.

Preparing for shutdown

Some months ago EPCC staff started discussions to agree a suitable time to close down all systems in Computer Rooms 1 and 2, which was required to provide the necessary four days of access to replace the ageing UPS equipment. 

ACF staff commenced work with our University of Edinburgh Estates colleagues and design consultants to design, cost and award a contract to carry out the UPS replacement project. We also looked to update the UPS capability by selecting a modern modular UPS system, which would provide a more operationally flexible system while also improving energy efficacy.

Copper Bus-bars inside Section Board being re-torqued to ensure suitably tight electrical connections.

Above: Copper busbars inside section board being re-torqued to ensure suitably tight electrical connections.

Because a large section of the ACF building was to be completely isolated, we suggested to University of Edinburgh Estates that this would be a golden chance for them to inspect and maintain most if not all the mechanical and electrical equipment in Plant Rooms A and B. There are also statutory and regulatory reasons to inspect and maintain much of this equipment.

Initial discussions started at the end of 2023 to plan all of the above, with discussions starting in earnest early in 2024 with stakeholders, University of Edinburgh Estates, contractors and designers to plan the requirements and decide on a suitable four-day window to carry out this complex and challenging project. 

A date was then set to commence works in April 2024.

Electrical Section Board showing Padlocked-off 3200Amp ACB's (Air Circuit Breakers) for safety ahead of planned In-depth maintenance & repair.

Above: Electrical section board showing 3200 amp Air Circuit Breakers (ACBs) padlocked-off for safety ahead of planned in-depth maintenance and repair.

Task list

While EPCC was in discussion with the computer system stakeholders about requirements to bring down the many systems in the computer rooms, I commenced a review of what planned maintenance could be fitted into this four-day window while the UPS was being removed, rewired and replaced. The list became quite extensive, with main tasks including: 

  • Opening up all main electrical section boards, Power Distribution Units (PDUs), and distribution boards to inspect, clean, test, remove foreign objects and re-torque bus-bar and cable connections. This is required as any loose connection is a potential fire hazard and/or cause of loss of output power supply, and objects inside these enclosures can present great danger to operators.
Computer Room Electrical Distribution Boards being opened up for Inspection, Testing and cleaning.

Above: Computer Room Electrical Distribution Boards being opened up for inspection, testing and cleaning.

  • Full maintenance of section board Air Circuit Breakers (ACBs), including outstanding repairs and replacement of any components showing issue due to their age. 
  • Replacement of all section board timer controls and voltage control relays. These are all constantly powered units which deteriorate over time due to internal heat and cannot be replaced without full power down and associated loss of output power. Failure of one of these units can cause a full power loss of the section board.
Computer Room Power Distribution Boards being opened up for Inspection, Testing and cleaning.

Above: Computer Room Power Distribution Boards being opened up for inspection, testing and cleaning.

  • Extensive deep clean and maintenance of Plant A UPS, which cannot be done fully unless the units are completely de-powered. This work improves the resilience of the system and removal of internal dust and dirt. It also makes switching operations much safer because it reduces the risk of arc flash.
  • Once the above tasks were completed and before power was restored to the Computer Rooms, the stand-by generators, UPS systems, ACBs, and the inter-locks for these systems which control generator operation were able to be fully tested and proved in a mock mains power fail situation for Plant Rooms A and B. It was reassuring to carry out this testing and prove the correct and safe operation of these complex and critical systems.
  • While all of the above was being carried out, the power down also allowed our Estates colleagues to carry out the necessary Electrical Condition Inspection Report (EICR) inspection and testing of the fixed electrical installation at this part of the site. This is a regulatory requirement for electrical installations and is known to be difficult to implement in critical buildings such as data centres because of the limited opportunity to disconnect the power supplies.
Electrical Section Board showing Isolators Padlocked-off for safety.

Above: Electrical Section Board showing isolators padlocked-off for safety.

Temporary power supplies

To arrange, programme and carry out all of the above maintenance was always going to be a difficult project for all concerned. But it was made even more tricky because several of the power supplies which would be temporarily lost not only served the two computer rooms, but also support two very important telecommunication rooms, power to the switches which connect the whole site to the outside world, the site security systems, and the staff office accommodation.

Therefore a separate pre-project was implemented to install several large temporary power supplies to all of these areas and connect them, including ensuring that systems were not lost during the power swaps ahead of the main project. Close liaison and work with our main electrical contractor ensured this project was successful.

New UPS batteries and D.C. Isolator.

Above: New UPS batteries and DC isolator.

Successful conclusion

Now that this week of power downs, UPS installation, and planned work is over and the previous months of preparation are in the past, it is good to reflect on this very successful project.

The ACF building is now in a more assured status in terms of M&E system operation; we have a new, more efficient and flexible UPS serving Plant B; we have completed our periodic regulatory compliance checks for these areas of the site, and our ACF colleagues not only managed to work throughout the disruption, but also carried out some great work in the computer rooms in preparation for future developments.

Inside view of BMS (Building Management System) control panel showing Outstations which control automation of M&E plant.

Above: Inside view of Building Management System ( BMS) control panel showing outstations which control automation of mechanical and electrical plant.

This multi-faceted project was complex, time-consuming and in some parts difficult to achieve, and was only successfully accomplished through detailed forward planning, liaison between all stakeholders and contractors before and during the works, and – most importantly – because everyone involved pulled together, worked long hours and applied a can-do attitude throughout. 

Thank you all!

Further information

Advanced Computing Facility: 

https://www.epcc.ed.ac.uk/hpc-services/advanced-computing-facility

Author

Calum D. Muir
Calum Muir