Information technology (IT) is at the core of operations across industries and the slightest of disruptions in IT services can heavily dent an enterprise’s business performance.
Uncertainty caused by events like the COVID-19 pandemic has forced business models to change on an urgent basis and has thrown light on IT’s challenges in terms of scalability, security, and business sustenance. We have seen in the past how technological glitches—for instance the AWS and Salesforce outages in 2021—resulted in revenue loss and customer dissatisfaction.
Technology related disruptions that impact business outcomes include:
Resource unavailability
Misconfigurations
Vulnerabilities, patches, and updates
Identity and access management
Insecure services enabled
Performance issues
Dynamic scalability
The aforementioned risks are dynamic in nature with complex interdependency between them. Under these chaotic technological disruptions, what should enterprises do to mitigate technology related risks and ensure business resilience?
Typically, different roles and teams under IT—such as application development, infrastructure and operations, testing and quality analysis, security, and site reliability—support and operationalize business systems. For a business to function seamlessly, these teams need to work together in harmony. In practice, we have seen these functions work in silos or with very limited interaction (for example, DevOps), but resilience, security, and performance are not considered integral to application development and delivery.
There is need for a framework that can enable alliance between IT teams to proactively mitigate technological related risks. The chaos engineering framework bridges this gap. It allows IT teams to incorporate nonfunctional requirement ground up during application lifecycle. It enables them to test hypotheses or assumptions in the real world and check system resilience.
The framework allows precise and measured amounts of failures and errors in the system, to improve business resilience. It enables organizations to prioritize business services that will benefit from improved resilience. It aids businesses to investigate vulnerabilities in the technology ecosystem and apply resilience patterns.
Its gamut of capabilities include:
Understanding of system modes and dependencies
Monitoring, tracing, and observing behavior of IT systems
Checking effectiveness of incident response process in case of emergencies
Testing out stability patterns
Identifying weakness and bugs that can cause business outages
Performing blameless postmortems
Chaos engineering experiments cannot be treated as another function of IT delivery. It involves a collaborative effort between different stakeholders (chaos specialists, production support, incident management, domain expert, testing expert, and DevOps teams) involved in application lifecycle and change in mindset.
Adopting chaos engineering is a journey and it involves building competency in a structured and collaborative manner. Below is a sample maturity model to adopt chaos engineering.
Organizations can start small in implementing chaos engineering framework.
A typical chaos experiment process involves identifying, prioritizing, and defining a steady state of the business function where resilience is needed. A chaos team then identifies failure scenarios, monitors key metrics, defines the last radius of the experiment, gets a buy-in from the business, communicates to the stakeholders, and plans game day for resilience testing.
On game day, the team conducts the experiment, and performs blameless incident analysis and a postmortem of the experiment. The chaos team then identifies the action and resilience pattern required for the resolution. As part of the next step, they coordinate with the team concerned to test and validate the resilience pattern in a test or preproduction environment, before applying the changes to the production environment, and plan for the next game day.
There are many commercial and opensource tools available to conduct chaos experiments.
Depending on the IT environment and internal capabilities, organizations can choose between commercial and opensource tools to carry out these experiments.
Businesses should leverage chaos engineering to build resilience and deliver definitive value to customers.