Business continuity is about disaster-proofing the business to ensure it keeps running 24x7. To adhere to regulations like Basel II or new standards like BS25999, financial institutions have got to prove they have robust, best practices in place.
High profile disasters like the 7 July London bombings and the events of 11 September, or potential epidemics like SARS or Bird Flu, have brought business continuity out of a dark corner and into the boardroom. But it’s not just big headline problems that effect the ability of a firm to run its business: the most well laid plans can also be disrupted by events that seem benign at first, like the 2012 Olympic Games which are effectively capping electricity supplies to Canary Wharf limiting the data centre expansion much needed in the financial sector.
Business continuity, simply, is the ability to roll with the punches and stay up and running no matter what. Financial firms are well aware of the threat to their business caused by an outage – client trust in the reputation, brand and business of a financial institution can waver. HSBC is one bank that has had recent experiences of outages.
Since the beginning of the year, HSBC’s Secure ePayments service has gone down three times – a fairly significant outage in January followed by more significant outages in March and the beginning of April. Merchants clamoured for compensation because they couldn’t process their payments, thereby losing out on business themselves. HSBC’s UK press office had not responded to Banking Technology’s questions regarding the outages or its business continuity plans by the time we closed for press.
To its corporate customers, HSBC appeared to be without a contingency plan. Said one client: “One doesn’t expect a major international bank to be in a position where it has no continuity arrangements in place such that, whatever it is that goes wrong, it doesn’t take the bank out of business for 48 to 72 hours. Also it shouldn’t put its customer service in the position where they can only say ‘we also don’t know what is going on – keep trying every 15 minutes’. If you run an e-commerce service, you have in place a rollover so that if your main system goes down your backup comes online – that is a pretty basic part of business continuity planning.”
David Porter, head of security and risk at technology consultancy firm Detica, dismisses the idea that a modern bank would be without a business continuity plan, but argues that the bigger issue is probably the question of when that plan was last dusted off, refreshed and simulated.
“In the old days you put your BCP in place and then you could all go down the pub and say job done,” says Porter. “But now banks need to dust off their plans and really re-assess them in light of today’s risks. Ten years ago, being deprived of the internet for 24 hours across all employees may not have been such a big deal, but I wonder how the average organisation today would cope if their email or internet access went down even for a few hours.”
Porter points out that the way data gets linked together – the soft human and also the hard data links – means that BCP practitioners should keep in mind that very small changes in one part of this massively linked network can have sudden and unforeseen implications on another, seemly unconnected, part. He uses the example of Buntsfield oil depot explosion in the UK which was connected to a number of employees at various companies not getting their 2005 Christmas payroll – all because of an unforeseen series of links between the explosion and a computer system nearby.
To cope with disasters hitting a specific location, most financial institutions’ best practice has been to move from local-oriented concepts, like mirroring data across distances of 10 or 30 kilometres, to more sophisticated schemes of having a third data centre in a different country or even a different continent. “Typically customers, large banking or financial institutions, would have a dual site setup where they do synchronous mirroring across mostly fibre optical links within distances up to 30 kilometres,” says Matthias Werner, secretary and co-chair, events committee at the Storage Networking Industry Association Europe.
“In order to comply with the needs for extended distances back up or disaster recovery sites, most of these customers would have a third site where they do asynchronous copy of most or all of their data thousands of kilometres away. To do asynchronous copying, you don’t have any physical limitations because typically these are remote sites and they would lag a couple of seconds or even minutes behind real-time data centres.” Werner believes that the three sites concept – having two synchronous sites and one asynchronous remote site – is cutting edge technology.
Tim Furmidge, head of products in BT’s financial services group, has a different take on the solution. “What firms are doing is distributing their systems across the main trading floor and perhaps a back up data centre or a disaster recovery location – but the secondary or tertiary system isn’t a separate lights out system that they are waiting to turn on if they need it. It is operating day in and day out and effectively it is a part of the day-to-day operational platform,” he says.
“If there is a flood in the trading building, the equipment that is deployed in a remote data centre carries on taking the full load instead of running a partial load. And if the traders can’t get into the trading floor they can relocate to alternative trading facilities either regular office buildings or purpose built ATF floors and then they can connect into the systems from there.”
BT’s ITS voice trading system can split the physical and voice communication service over multiple data centres; it allows traders to connect to their turrets over the network so that they can connect in from a remote alternative trading location or come in over a web browser from home. And with BT’s Radianz shared market infrastructure, many firms have dual connections coming into their main trading building to their data centre, so if they lose their main building they can very easily switch their market data services down electronic feeds to alternative locations.
GoldenGate Software promotes a dual online approach and focuses solely on the continuous availability of data because, as Sami Akbay, vice president of marketing and product management, points out: “Data is somewhat unique in the sense that unlike hardware, software, wires, and cables, once you lose data you cannot really replace it. You can buy new servers, cables, racks, and all that stuff but if you have lost the data you are in deep trouble.”
With GoldenGate’s Active/Active service, both the main and the backup systems are processing transactions and if one becomes unavailable the other one seamlessly takes over and when the primary system comes back online the workload is redistributed again without any transaction loss (see box).
These best practices and more have been codified in a new British standard BS25999, for which a certification service was launched last October. This allows firms to prove that they are following best practices, something that was difficult to show hard evidence for previously. But if the standard is to succeed, it has to be a generic standard that the small guy can cope with as well as the big guys. Many of the top financial and banking organisations, however, believe that they are already equal to, or in advance of, the standards set out by BS25999.
Mike Osborne managing director of ICM Business Continuity Services says that these financial institutions will look at the standard as a supply chain management tool. “Most banks think they are better than that, but they are looking at their supply chain – the firms that dovetail into their technology solutions in terms of information feeds, service providers, etc. Where BS25999 will have its part to play in the banking sector is the way in which the smaller organisations are asked to comply with BS25999. I personally believe that if they don’t, the banks will say I am sorry we are not going to renew our relationship with you because you represent too high a risk.”
Swedbank takes Active/Active approach
Nordic retail bank Swedbank processes electronic payment requests for a number of Swedish and international banks, as well as ATM transactions and payment requests for its own customers. With its growing international presence, the bank now processes more than one billion transactions per year.
Swedbank has been a long-time user of ACI’s Base24 application running on HP NonStop servers. Initially, its business continuity plan involved operating a “hot” backup site for testing and for failover in the event of an unplanned primary system failure. However, as it continued its global expansion, the time that it took to fail over to the backup system for both planned and unplanned outages barred Swedbank from achieving true 24x7 availability for its customers. The bank realised that any type of outage has an impact on customer satisfaction and loyalty, which ultimately can affect the bank’s revenue.
In 2006 Swedbank decided to implement an Active/Active configuration with GoldenGate’s High Availability solution. “Having evaluated various data migration and availability solutions, we decided to deploy GoldenGate because of its interoperability with the ACI Base24 application, and because the solution had already been proven elsewhere at Swedbank where it was deployed by other departments,” says Magnus Kleveby, systems area manager for authorisation processing at Swedbank.
The Swedbank Active/Active system runs on two HP NonStop server nodes separated geographically for disaster tolerance best practice reasons. Both databases are active and are processing different transactions against their own copies of the Base24 application database. Additionally, transactions are split between the databases to provide load balancing. In the event that one database fails or must be taken offline for planned hardware or software maintenance, upgrades, or migrations, all transactions are simply routed to the surviving node for processing. Thus, planned downtime is eliminated; and recovery from a sudden failure occurs literally within seconds.
“In the event of an unexpected outage, we can restore data within seconds and the system can cope with sudden peaks in demand, such as at the end of the month when most people do a lot of shopping or go online to pay bills,” says Kleveby. “GoldenGate has given us the assurance we are looking for and we can maintain our level of customer service no matter what.”
Swedbank’s Active/Active configuration was also leveraged during a migration across the HP NonStop environment when moving to the new HP Integrity platform. By taking down one server at a time, upgrading it, and then returning it to service, this major upgrade was achieved with no application downtime.