Winchester Exchange Outage
Incident Report for M12 Solutions and Giganet
Postmortem

Firstly an apology to affected customers of this incident that we experienced yesterday morning in our equipment rack at the Winchester exchange. All Giganet Local connectivity services were temporarily and unexpectedly taken offline due to a fault within our rack.

This is the first such outage we have experienced in any exchange since we began offering Giganet Local services just over a year ago.

It is also the first issue caused due to a loss of power anywhere within our core network since we started running our own network!

Lessons will be learnt from this incident so that we hope to never experience this kind of problem again, however as much as I would like to guarantee no further outages across any of our services, it would be naïve for me to make such a promise, as sh*t happens!

Tl;dr

Our UPS battery backup system failed with an ‘inverter error’. The UPS protects and powers our networking equipment in the exchange in the event of a main power failure and before the exchange’s generators kick in. During the restoration of services by our on-site engineer, and after the UPS was re-attached to the mains supply, the power to our equipment failed again. This caused our network switch to partially lose its config as it was in the middle of start up. We then had to re-build the config and firmware on this switch to bring services back online. This took extra time where otherwise services could have been brought online sooner. The UPS is now removed and bypassed, and our equipment in the exchange is currently ‘at-risk’ of a power outage until we replace it which we plan to do next week. Escalations to the UPS manufacturer will be made to see why this happened.

Timeline

The timeline of events are broadly defined by our incident page, however I will supplement this here:

1. 09:45 - Outage detected by our core monitoring system saying that our Winchester exchange equipment was offline.
2. 09:50 - Internal technical teams discuss course of action for the outage.
3. 10:00 - Engineer dispatched to the exchange to investigate on-site.
4. 10.30 - Engineer arrives at the exchange
5. 10.35 - Engineer finds no power to the equipment in our equipment rack
6. 10.35 - Engineer troubleshoots reasons there is no power.
7. 10.40 - Uninterruptible Power Supply (UPS) is found to be faulty and displays an “Inverter Error” code. It’s not providing any power to any downstream equipment.
8. 10.45 - UPS is bypassed and equipment is directly connected to the Exchange’s generator protector supply via our rack’s 6-way gang mains sockets. However no power is available on this gang either.
7. 10.55 - Equipment is then connected to a test socket in our rack, which is powered.
8. 10.57 - The UPS is restarted (following the UPS’s instructions following this error code), and connected to the test socket. The RCD trips for this test socket and power is lost.
9. 11.00 - We assume the UPS must be faulty and caused this outage in the first place, so remove and isolate this. The RCD is reset and equipment is powered on once more. However our main switch doesn’t re-establish connection to our network.
10. 11.10 - We call BT support to troubleshoot there being now power on the main power sockets in our rack. They advise of a fuse potentially being blown. We replace the fuse and bring these sockets back online.
11. 11.15 - Work continues to try and re-establish connectivity between our network switch that provides downstream and upstream connectivity to customers and our network. Due to the second unexpected mains outage of this switch whilst it was booting up and whilst bypassed from the UPS, the configuration is corrupted and needs reloading. Our engineer proceeds to reload the firmware and a backup config of the switch.
12. 11.50 - The network switch is reloaded and operational. All customers downstream are back online.

Duration of outage

09.45 - 11.50.

2 hours and 5 minutes.

Reason for outage

The UPS battery backup system that protects our equipment in the exchange in the event there is a main power failure, brownout or surge, failed with an inverter failure error code.

This proceeded to cause a loss of mains power to all our equipment and therefore any customer connections routing via this equipment would have lost their connection.

The UPS blew a fuse in the electrical distribution gear in our rack. This distribution fed the UPS.

When attempting to restart the UPS, following the instructions in the UPS manual for this type of incident, it tripped the power again.

We then isolated and bypassed the UPS entirely.

The second outage caused by the UPS tripping the breaker appeared to corrupt the network switch’s configuration/firmware file. This extended the restoration of service.

'At-risk'

Our Winchester Exchange Point of Presence (POP) is 'at-risk' as the equipment is bypassed from a UPS battery backup system.

The Winchester Exchange and our equipment rack is protected with a generator in case of extended local power loss, however any interim period between the power loss and generator starting up and providing power will cause our equipment to lose power and function.

Maintenance activity will be distributed via this Status Page when we plan to introduce the replacement UPS.

Follow up actions

We shall be ordering a replacement UPS to be replaced next week, and escalating the issue with the failed UPS to the manufacturer for further comment.

What can customers do for added resilience?

Although I would never wish there for there to be any problems with our services, this is simply a utopia as with any other provider, as nothing can be guaranteed to be fault free. We do our best however to ensure that our services are as resilient as possible as well as resolving issues quickly when they do. This also backs up our open and transparent updates to customers like this.

This being said, there are ways for customers to increase their connectivity resilience in times such as this. We can provide a backup broadband or leased line service to any of our Giganet Local services that would be been unaffected during this incident. With a Giganet fully managed backup service, we can automatically failover your connectivity in the event there is an issue with the primary service whilst retaining your IP addresses. When the connectivity is restored, we then automatically fail the connectivity back. Please speak to your account manager for more information if you do not currently have a managed backup broadband connection.

Matthew Skipsey, CTO. 19/01/19.

Posted 5 months ago. Jan 19, 2019 - 11:22 GMT

Resolved
This incident has been resolved.
Posted 5 months ago. Jan 18, 2019 - 14:39 GMT
Monitoring
The fix was implement at 11:54 and all affected customers should have had their connectivity restored from this time.

We are currently monitoring the situation.

A reason for outage will be provided in due course.
Posted 5 months ago. Jan 18, 2019 - 12:33 GMT
Update
We apologise for the continuing disruption. It is taking slightly longer to restore the service.
Posted 5 months ago. Jan 18, 2019 - 11:30 GMT
Identified
Our engineer has arrived at the exchange and is in the process of resolving the problem.

We anticipate service being restored within the next 15 minutes.
Posted 5 months ago. Jan 18, 2019 - 10:52 GMT
Investigating
We are currently investigating a problem affecting our Winchester Exchange point of presence (POP) since 09.45.
Any customers whose Internet circuits route via this POP will be affected.

We suspect this to be power related as we have lost remote management of our equipment which routes via a diverse circuit.

A Giganet engineer has been dispatched to the exchange for further investigations.
His current ETA is 10.30.

We shall provide further updates as we learn more.

We apologise for any inconvenience this may cause.
Posted 5 months ago. Jan 18, 2019 - 10:00 GMT
This incident affected: M12 Giganet - Data Centres & Points of Presence (Giganet Local - Winchester Exchange).