Continuing the Hybrid Platforms Trends Series on Cyber Resilience, we are going to look at why the strategy for recovering the platform is not the same as recovering the business, due to the differences between a traditional DR event and a more all-encompassing ransomware attack.
Platform Recovery Is Not Business Recovery
So, what do I mean by platform recovery is not business recovery? If we consider recovery of the platform as infra, identity and compute instances (VMs, containers or even PaaS services), recovering these does not automatically mean we have restored the business function. The business function, the thing that actually allows the organisation to operate, might have reliance on specific data flows, external dependencies, or links to other systems.
How this complex set of applications operate and interact is not always well understood, and at times the knowledge is held in teams outside of core IT operations. When we consider the complexity of a cyber recovery event, the pure act of recovering the virtual machines (and containers etc) might not result in the recovery of that business process. If we don’t recover the business process, we will not restore business.
Disaster Recovery vs Cyber Recovery
Most people can perceive what will happen in the event of a traditional disaster. Many have lived it, and the process is well understood. Fire, flood, power failure, and hardware or software issues all result in the same outcomes. We understand the specific time it happened, and the systems went down, we have a DR plan, and we can execute from the last known good restore point. Of course, it’s still a stressful time but the variables are limited and mostly of a known quantity. We accept a level of data loss, usually measured in hours or minutes, and we can often place a timeframe on the recovery that can be communicated internally and externally.
Now consider the ransomware attack.
We can’t be sure when it started or how long systems have been compromised. The exact time of encryption will be different per system, identity services will be compromised and it’s likely that the DR site will also be compromised. Then we layer on the random demands, regulators and the pressures that incident response and forensics bring. We are now talking about a recovery process that might mean multiple restore iterations per system, restoring OS and data from different time points, and the complexity of ensuring data and system integrity.
The realities of such a recovery mean we need to take that base Disaster Recovery knowledge and augment it to be cyber-ready. This will require different technology capabilities. Think about how you restore and validate 20 restore points per system when searching for the 'clean' version; if each restore takes hours the impact will be crippling.
Just as important are new processes and education of users to reduce the time impact. This is why we need to consider this as a business recovery conversation, not just a platform conversation!
The Isolated Recovery Environment (IRE)
One area that become a must-have consideration is that of the Isolated Recovery Environment (IRE). This is a deployment that is separated from the production backup infrastructure with no connection or interaction with it. This is where you send your critical data and systems that make up your minimal viable company (more on MVC later) to enable a rapid recovery operation. If we consider that in many ransomware attacks, the secondary/DR infrastructure will be compromised (including hypervisor in many cases) along with the primary site.
Additionally, the forensics teams are likely to stop you from rebuilding that equipment to preserve critical information for the investigation. We are now in need of a clean room to recover into, one that we can be 100% sure was not compromised (due to a complete air gap). This is the role of IRE, providing compute, storage and networking to recover critical systems, as well as tooling to allow easy identification of the correct restore point. A good IRE will provide guided workflows to reduce the complexity of recovery as well as network isolation capabilities to avoid reinfection.
Non-Technology Tips To Survive a Breach
As we have discussed, the differences between a traditional disaster event and a cyber event present several unique challenges from a technology standpoint. The other big difference comes from the operational, people and processes that underpin a successful recovery. A couple of years ago I was working with an organisation following a bad attack, after the event the Head of IT described it as "running around the datacentre with your hair on fire!" I thought it would be interesting to see how Generative AI turned this into an image (created with Adobe Firefly). Having been through a number of these events It can certainly feel like this!
AI-generated image of living through a ransomware recovery
Additional Concepts To Consider for Cyber Recovery
A key thing to keep in mind is that the experience will be very different from that of a traditional recovery, and if we don’t think about some additional steps and processes it will be painful. Below are 10 concepts and tactics to keep in mind when building your cyber recovery plan, it’s not an exhaustive list, but should help douse some of that 'fire' of a ransomware attack.
- MINIMUM VIABLE COMPANY (MVC)
- CHAIN OF COMMAND
- LEVERAGE EXPERTS
- YOU KNOW YOUR BUSINESS
- SECONDARY COMMUNICATION
- CYBER INSURANCE
- STAFF BURNOUT
- HOW DO I PUT YOU OUT OF BUSINESS?
- DON’T THINK. DO
- IDENTITY RECOVERY (ACTIVE DIRECTORY)
Minimum Viable Company (MVC)
When building your recovery plans consider the concept of a minimal viable company, what applications, data and systems do you need to restore critical functions? Then consider how that will change depending on when the attack happens, month end v mid-month could change the minimum viable application set, same for time of year or quarter. Ensuring this is clearly documented and understood as part of your recovery plan could significantly shorten recovery times and minimise the impact on revenue and/or reputation.
Chain of Command
Who is making the decisions? Who is communicating internally and externally (think customers, coworkers, regulators, suppliers), who is the liaison to the forensic teams etc.? When the worst does happen, a lot of key decisions will need to be made and a lot of information processed and communicated. Having a clear chain of command established with key roles and responsibilities defined will be key to smooth operations. Your IT function will be under massive pressure, and they don’t need to be bombarded with information requests. Your customers will need clear and calm communication and the regulators will need to be kept informed. Make sure you define who and how they will manage this communication flow, and that the right people are empowered to make the tough decisions!
Leverage Experts
If you are lucky (or unlucky) enough to have people that have lived through multiple ransomware recoveries, then you might be able to skip this one. The reality though is there is still not a lot of real-world experience inside most organisations. Leveraging the experience of experts who have gone through many such events should be a top priority to shortcut your way to a successful cyber resilience plan. From helping define processes to executing tabletop exercises and full simulations, having an expert response firm integrated into your process will give you the best chance of survival. Also, if they are embedded into the planning they can be much more effective when called upon on that dark day!
Know Your Business
Leveraging experts is key, but only you know your own business in the requisite level of detail to enact a successful recovery. Make sure you have this detail documented and accessible during the recovery process. A lack of understanding of how business processes, systems and data are integrated can make the recovery process a painful experience. As we touched on above, platform recovery is not business recovery, as you will likely be recovering to a different environment application communication path. Firewall rules and data pipelines will be key to understanding, along with known testing processes to validate operations. These are all things that should be documented before any possible incident, but sadly, are normally not known in enough detail.
Secondary Communication
How will you communicate with all the interested parties during your response? One of two scenarios should be considered. One; you don’t have access to any primary business communication tools: email, Teams, Zoom, Slack etc. Two; you do have access, but can you trust who might be listening in?
The first scenario is the most likely. Identity services are compromised, making access to standard enterprise communication tools challenging or impossible. Contact numbers for key people need to be available offline and established methods of communication put in place. You don’t want that key decision maker struggling to install WhatsApp or another tool in the middle of the event! Break glass accounts for key systems should also be stored in an accessible manner; email gateways can be a good place to regain some communication capabilities along with access to corporate websites to update customers.
Of course, you need to consider scenario two as well. If the attacker can, they will monitor communications, giving them valuable insights into your recovery activities that could allow them to stay one step ahead!
Cyber Insurance
There is a lot of talk about Cyber Insurance worth in the event of a breach. A deep dive into this topic is for another time, but do consider two things.
Firstly, can you accurately estimate the cost of a full recovery? Your insurance is not worth much if it covers only a fraction of the real cost. Secondly, really understand the small print of your contract. I was speaking to an organisation that had insurance cancelled because they missed a small detail in the terms. In this case, it was a simple matter of the order in which they spoke to people, easily avoided but costly!
Including your cyber insurance company in your tabletop and simulation exercise can help ensure you don’t fall foul of such clauses. Don’t just write - contact insurance on the plan, do it and have them integrated into every test.
Staff Burnout
One thing I have learned from living through a few cyber attacks is the pressure that gets placed on the infrastructure teams; they are the ones with the deepest knowledge and are generally in control of the core systems like identity, platforms, backups, and communication.
During your planning stage, create a map of these skills and who can execute the different recovery steps, where are the bottlenecks in people? Consider that you cannot push your staff to work 24 hours a day for very long before burnout sets in, at which time mistakes will be made that could make the situation much worse. Also, considering the long-term impact of how you look after your key staff during this time, I witnessed one customer’s entire third line team quit as soon as the recovery was completed due to the pressure and treatment! Not the best result for continuing cleanup and day-to-day operations.
How Do I Put You out of Business?
When thinking about our recovery plans, data protection strategies and internal processes one concept I put to the room was, "How would I put you out of business in 24 hours or fewer?” If you look at your own organisation through this lens you might uncover critical systems, processes, technologies or people that would not normally have been considered from a technical protection perspective. This is the lens threat actors will be using, and their external perspective could make it easier to spot a weakness that you have not considered. This 'failure to imagine' is often the lesson learnt post-incident.
Don’t Think. Do
When the worst does happen, it is not the time to make decisions and come up with plans. Your teams will be stressed, tired and in some level of ‘panic mode’. Any decisions made will likely be the wrong ones with the potential to either make things worse or at least extend recovery times. Please plan prepare and rehearse all aspects of your recovery processes before, including all departments, including your external support structures like incident response or forensic teams. Also don’t forget the mundane stuff like communication plans, liaisons with regulators, cyber insurance companies and your customers. Make it so you don’t have to think, and you can just Do it!
Don’t Think. Do Identity Recovery (Active Directory)
Most organisations are still linked to Microsoft Active Directory for identity in some form or another, attackers know this, and getting hooks deep into this technology is usually a key objective. Once those hooks are in place, they can be very difficult to root out and remove, leaving your organisation at risk of a second compromise. Consider that one of the first things most incident response organisations will require you to do is build a fresh Active Directly to avoid this possibility. Considering the complexity of rebuilding AD from scratch is a scary thought and one that could extend recovery times considerably. Get ahead of this and deploy dedicated solutions that can backup AD and provide proven recovery models, it’s much easier to be ahead of this than behind (and a simple Domain Controller Backup is not enough).
Summary
That is the end of this trends series,. Hopefully, it has put a spotlight on how organisations need to look at Cyber Resilience in a different light, and can help you be more prepared for the worst and even reduce the chances of compromise in the first place.
- Consider the tips from Part 1 and move to an organisation-wide approach to security; it’s not just a security function.
- Understand your data and put levels of protection in place based on criticality and sensitivity.
- Build a response plan that considers the differences between DR and Cyber Response.
The bad actors will continue to adapt. As we adopt technology to protect our organisation they will continue to innovate, and the commercial incentive to do so is massive. The focus on recovery over the last few years is starting to see a change from encrypt to extort; a tactic that backups and recovery plans can’t help with.
We must change the focus and harden our organisations at all levels to ensure we can withstand being compromised in the first place. At CDW our Visualise | Withstand | Survive approach is designed to help organisations on that journey and ensure you are as prepared as possible.
Contributors
-
Rob Sims
Chief Technologist - Hybrid Platforms