Foundation principles are as critical as future-state capabilities.
The recent CrowdStrike Falcon update, now regarded as the world's largest IT failure, has sparked widespread discussion on how organizations can better prepare for and respond to such crises. Though many feared it to be a large-scale hack, it turned out that the incident was neither a cyber-attack nor stemming from a malicious activity, however, it was a faulty software update by a Third-Party cybersecurity firm.
While much of the post incident analysis has been cyber-focused—reflecting the significant investments in cybersecurity - there’s an underlying disappointment that this wasn’t a cyber event, akin to the Y2K industry's anticlimactic outcome.
Stripping away the outages complexity, the core issue lies in a fundamental breakdown of IT Service Management (ITSM) processes, particularly in change and release management. This article distills the foundational lessons from the CrowdStrike experience, offering critical insights for risk organizations to fortify their enterprise risk management and operational resilience frameworks.
The Banking, Financial Services & Insurance industry regulatory expectations are due in 2025 for jurisdictions in several countries.
The UK, and the most recent North American, Office of Superintendent of Financial Institutions (OSFI) E-21, organizations are expected to ensure Operational Resilience is embedded across the enterprise in 2025 with a key focus on:
Embedded Operational Resilience within Enterprise Risk Management
Detailed mapping of service delivery
Defined critical / important business services and validated impact tolerances
Deeper and more sophisticated scenario testing
Ensuring mechanisms are in place for self-assessment reporting
The role of the Chief Risk Officer (CRO) has transformed considerably in response to rising global uncertainties and more intricate business environments.
Heightened regulatory scrutiny, especially in the banking, financial services, and insurance sectors as highlighted in the image above, has expanded the CRO’s focus to include areas like operational resilience, third-party risk management, cyber risk, ESG, digital transformation, risk culture, and data analytics. These responsibilities reflect the intricate and interconnected risks that industries face today.
In this dynamic landscape, the Risk Organization’s role is more critical than ever.
There is a rising expectation for Risk programs to adopt a proactive and strategic approach to risk identification and mitigation, ensuring resilience is built into services by design. Meanwhile, the Chief Resilience Officer (typically in the first line of defense) is also evolving, tasked with navigating the complexities of embedding resilience across the organization.
Change and Resilience
The preliminary CrowdStrike incident report highlights that they do have an extensive quality assurance process for regular updates, however, the update that caused the incident was a Rapid Response Content which goes through a different process.
It raises multiple questions on CrowdStrike’s overall testing, validation and rollout process, and this serves as a perfect example of why it is critical to ensure that resilience is embedded throughout an organization’s complete environment; specifically change management processes.
Despite careful planning, IT Service Management (ITSM) change management still consistently impacts organizations. Poorly managed changes often trigger major disruptions, especially in complex IT environments.
Studies consistently show that change related incidents account for a large share of high-priority (P1 and P2) incidents, as even minor oversights can lead to major disruptions.
Integrating Resilience into foundational ITSM
Through the integration and embedding of resilience into ITSM processes, organizations can seamlessly enhance existing structures.
Resiliency focused regulatory requirements, specifically identify the requirement of testing ‘severe but plausible’ scenarios with the objective of pro-actively identifying key service level vulnerabilities.
Testing of scenarios of course is not new and in recent years there has been much attention on ‘cyber threat’ driven scenarios, which although prudent, has taken away attention from broader scenarios which require recovery support from the wider crisis management team(s).
Operational Resilience brings a wider service focused dimension and recent industry outages highlight the need for organizations to develop testing regimes, which stress the most critical processes across business and technology. Some key considerations include:
Fortifying operational resilience is not just about focusing on the core organizational frameworks, but also about strengthening the broader extended organization.
It ensures that any third-party vulnerabilities do not end up disrupting your operations, as we saw in the CrowdStrike event. Hence it is extremely important to:
As organizations adopt advanced techniques like Generative AI, Natural Language Processing, and Big Data analytics for third-party risk assessments, its essential to remember that foundational practices – such as clear contract criteria, regular communication, thorough change assessments and robust documentation – remain vital to ensuring operational resilience and mitigating disruption risk effectively.
Continuous, proactive oversight and strong third-party relationships significantly boost operational resilience. Unlike periodic assessments, ongoing monitoring offers real time analysis of processes, controls, systems, and performance metrics.
As organizations strive to develop future-state, regulatory-aligned integrated risk and resilience programs, it is imperative not to overlook and sustain foundational principles.
The CrowdStrike outage serves as a stark reminder that even the most advanced systems can falter without a strong foundation. Organizations need to balance innovation and risk investment to ensure foundation capabilities are as robust as aspirational future state programs.
Post Regulatory compliance the approach to building ‘Integrated Resilience’ becomes critical. Integrated Resilience provides a holistic approach to the management of risks and enhancing an organization's ability to withstand, adapt to and recover from disruptions.
The cross functional approach, in line with an organization’s resilience framework provides the capability to effectively monitor and manage service level impact tolerances for critical / important business services spanning across multiple legal entities globally.
Risk organizations must remain vigilant, ensuring that innovation and expansion are grounded in proven risk management and resilience strategies, embrace the integration of these strategies, while safeguarding their organizations against both known and emerging threats, and ensure that foundation principles like Change Management are designed with resilience in mind.
A strong structure requires a solid foundation. Invest in what supports you today as you prepare for future growth.