“Microsoft Azure has always been committed to providing highly reliable, available, and recoverable services to our customers. This customer-focused passion is reflected at every level in the Microsoft Azure Business Continuity Management (BCM) program. At the direction of the Microsoft board of directors, in 2007 we established a comprehensive risk program including the BCM program to address assured recoverability of services for our customers. For today’s post in our Advancing Reliability series, I’ve asked Robert Arco, our Senior Program Manager overseeing this program, to explain how we approach business continuity management in Azure, and how we’re continuing to improve the program as our platform evolves.”—Mark Russinovich, CTO, Azure
How we define a “service” for our BCM program
If you ask three people what a service is, you may get three different answers. At Microsoft, we define a service (business process or technology) as a means of delivering value to customers (first- or third-party) by facilitating outcomes customers want to achieve.
To ensure the highest level of resiliency for each of our “services” we include:
- People: The people who are responsible for providing the service.
- Process: The methodology used to provide the service.
- Technology: The tools used to deliver the service or the technology itself delivered as the value.
Customers see our services as product offerings that are comprised of various bundled services. Each individual service is mapped in our inventory and run through the BCM program to ensure that the people, processes, and technologies for those services are resilient to a variety of failures.
Our end-to-end program identifies, prioritizes, maps, and tests every service providing more than “box checking” compliance. Instead, we focus on a broad understanding of how to provide the best service to our customers who demand reliable service offerings for their business.
How the BCM program is managed in practice
Through a sophisticated set of tooling, every service (both internal and external facing) is uniquely mapped and shared with a string of compliance tooling addressing privacy, security, BCM, and more. This ensures that every service contains sharable meta-data for other tools regardless of type or criticality.
In the context of this post, records are automatically ported to our BCM management tool. Once there, they are automatically scoped for disaster recovery (DR) requirements that meet certifiable standards and to deliver on our customer promises. These records contain the most familiar elements of a BCM program, including business impact analysis, dependencies, workforce, suppliers, recovery plans, and tests. In addition, we provide insight into potential customer impacts, detection capabilities, and willingness to failover.
No amount of tooling, policies, or documents can provide the same level of confidence in service recovery and sustainability as comprehensive testing. Azure services test at various levels ranging from individual unit tests, all the way to complete "region down" scenarios. Every service must show proof of testing and that their recovery meets their stated goals—both internally and what we guarantee to our end customers in the Service Level Agreements (SLAs). Tabletop testing, in which simulated emergencies are merely discussed, is not considered acceptable or compliant for our program.
Our most robust integrated testing takes place in our “Canary” environment that consists of two distinct production datacenter regions: one in Eastern Ubited States and the other in Central United States.
On a regular basis, we test service recovery with a complete zone or region shutdown (simulating a major production outage or catastrophic loss), forcing all services to invoke their recovery plans. These tests not only verify service recoverability, but also test our incident response team’s processes for managing major incidents. For Availability Zones, we test and verify the seamless continuation of service availability in the face of an entire zone loss. These are end-to-end tests that include detection, response, coordination, and recovery.
All processes from detection to response and action are performed as if it were a real service-impacting event. Service responders are the normal on-call engineers. Additionally, we also test synthetic customer responsible functions, such as virtual machine (VM) failover to paired regions, ensuring customer workloads can operate in large scale failure scenarios.
Availability Zones—our highest level of seamless availability
With more Azure regions becoming zone-enabled, our customers have additional options for resiliency with the highest level of availability supported by SLA and in-region disaster recovery without the need to failover out of region. Advantages include:
- Customers can have the highest level of availability and transparent recovery in a zone down situation.
- Data is synchronously replicated—no data loss due to async to another region.
- No potential for latency due to secondary region distance.
Customers can leverage regional high-availability, multi-region remote disaster recovery or both. This "belt and suspenders” path provides the highest level of assurance that services will be resilient regardless of impact. Coupling high availability of Availability Zones with the out of region option to a remote location as a failsafe to the most catastrophic regional events.
Just as we do robust testing for cross-region disaster recovery, we perform the same diligence to our zone enabled services. Using our Canary regions, we are able to perform end-to-end zone down drills proving our capabilities in providing the best reliable services to our customers.
The Microsoft BCM program follows all industry and government standards—addressing identification of services, calculating impact (recovery time objective or or recovery point objective), dependency mapping, concise disaster recovery plans, and testing those plans. These plans are reviewed at every level and verified via comprehensive end-to-end testing.
The program itself has achieved dozens of industry and government certifications, including ISO 22301 which is the highest standard a program can achieve. In fact, to date, Azure is the only cloud service provider to achieve this rating.
Azure has been able to achieve these ratings by ensuring we have the following elements to maintain a successful and value add program:
- Leadership support and awareness at every level.
- Extensive policy, standards, and training documentation.
- Dedicated BCM practitioners with experience in driving a mature program.
- Transparent reporting and gap analysis driving informed decision making.
- Comprehensive testing of services ensuring that what we measure is accurate.
- Modern tooling driving the high-volume scalability ensuring compliance in the program.
The Microsoft BCM program is one of the most mature in any industry. It has demonstrated not only its commitment to meeting regulatory and compliance requirements, but has proven to be customer focused in ensuring highly available and reliable services. In addition, by adding Availability Zones to the mix, our customers can receive the highest level of transparent service availability without the more impactful region to region disaster recovery.
As we move forward in providing highly resilient customer solutions, our program advances along in lock-step. In 2021, we have increased our test frequencies and end-to-end test scopes to ensure we can capture deficiencies (if they exist) and drive program remediation. These include enhanced Availability Zone testing, as well as region to region failures and recovery.