Thursday, August 5, 2010

Singapore DBS bank services outage

In the head lines of the Strait Times for recent DBS bank outage survey (Click for news), it points all the causes to two engineers and a faulty cable. The two field support engineers didn't change the cable via maintenance interface but using the support center's instructions. IBM claimed that if the correct procedures are used, the system will take precautions procedures automatically to preserve full data integrity.

As this is serious outage that not only impacting the image of DBS but also Singapore's local banking brand as a whole, MAS is taking this seriously by giving prompt sanctions. The bank in question is asked to set aside further SGD$230 million for additional regulatory operational risk. That is great. Nothing is more important to a bank than the money. So monetary punishment is an effective approach. It also send a positive signal to the rest of the banks to Singapore to look the the same issue. I am sure the it is not a localized issue to DBS.

However, I feel sorry for the engineers involved. They did what most good field support engineers will do. They took logical steps to diagnose the issue, divide and conquer the problems, escalate it to supervisors at the right time when they couldn't solve the problems, and give appropriate suggestion to change the cable at wee hour. For goodsake, they performed the cable changes solely based on the instructions from the support center. The "it's all your fault because you didn't using the right procedures to change cable" blame is cruel to them without understanding the root cause.

You can't blame the faulty cable as well. We all know that electronic products have a limited lifespan. It will happen sooner or later, won't they?

Then, why the main media as The Straits Time, point all their fingers to "two engineers and a cable"? The report jointly issued by DBS and IBM included a section for corrective actions. IBM go further mentioning that they discipline the direct personnel involved. I assumed they are the field support engineer and his counterpart in the IBM regional support center whom gave him instructions.

I feel the two engineers are the victims of a bigger issue within IBM: Lack of trainings to support teams for up-to-date issues recovery procedures and change management in-place is not sufficient to avoid risk. For example, why the regional support center fail to give the right procedures at the first place? Why IBM doesn't dispatch two field support engineers, instead of one, to data center to investigate (have they?)? Given the criticality of the data infrastructure, the fundamental question to IBM is, why this kind of cable-fault simple scenario is not covered in their quality control policies in the first place.

MAS blows up the issue to "lack of robust technology risk management framework", "not sufficient oversight of maintenance, functional and operation practices" and "now following MAS guidelines". However, doesn't this issue also show MAS itself, has problem in their monitoring and control exercise?

Engineers, always the first to blame and last to be appreciated.

No comments:

Post a Comment