At first glance, chaos engineering sounds similar to extreme programming in the early Agile days. Chaos meant random changes and continuously shifting requirements and application functionality. Chaos engineering isn’t about the application functionality per se, it’s about the stability and functionality of the production server after a new release deploys. Chaos engineering, otherwise known as chaos testing, attempts to address testing coverage gaps between a test server and a live server with real customers, data, and transactions.
As software applications get more complex and integrated, they fail. Companies like Netflix and Amazon have frequently been victims of their success. Several times a thundering herd issue hits the system in varied ways and causes significant system failures where customers lose access to the service provider.
Think about it outside of a retail/service environment for a moment. Having to wait to shop or stream doesn’t sound like a critical problem. But consider a complex healthcare system that functions using integrated and dependent systems including APIs, microservices, third-party software, and medical devices. What happens when the system goes down? Patients are adversely affected, providers are at risk, and physicians go back to manual processes which are slow, inaccurate, and time-consuming.
This guide describes the basic principles and benefits of chaos engineering, and how it impacts the QA testing team and provides higher quality software application design and function for improved customer experience.
- What is Chaos Engineering?
- What are the benefits of Chaos Engineering?
- How to improve testing and application design using Chaos?
- Determine how the QA testing team can manage chaos engineering test design and execution.
- Discover the value of executing chaos tests on production.
- Learn the importance of a blast radius when testing in production.
What is Chaos Engineering?
Chaos engineering means testing a distributed computer system using random and unexpected failure conditions to identify weaknesses present in the system. Random and unexpected actions, failures, and conditions equal chaos.
Chaos engineering is a software development methodology that enables testing creativity and expanded test coverage to discover and plan for system errors. Not the average system error, but catastrophic errors that take down the network and cause customer access interruptions for any length of time.
Originally established by Netflix when transferring their entire infrastructure to AWS. Netflix developed two principles to test to prevent or minimize the impact of the move on customers.
Chaos engineering principles include:
- Systems never have a single point of failure.
- A single point of failure refers to the possibility a failure in the system leads to customer interruption or significant access downtime.
- Systems always have at least one single point of failure.
- Software development teams must create effective tests and monitor the system to ensure there is never a single point of failure.
Chaos engineering proactively identifies errors to prevent production server outages from impacting customers. Chaos engineering is not random, or undisciplined testing. Chaos engineering relies on the ability to monitor the production server and execute real-life test simulations to determine how the application responds to failures in integrated or connected services and systems.
Chaos engineering includes performing the following functions on the production server:
- Define a steady-state or baseline to measure the application and server against.
- Determine if the defined steady-state holds during experimental testing.
- Test with minimal impact on users by defining and implementing tests within a blast radius.
- Defining a blast radius means chaos tests are focused on a particular area and the resources are available to immediately respond to failures.
- Introduce the planned chaos events in order, contained by the defined blast radius.
- Introduce scenarios to mimic real-world failure scenarios. Failure scenarios examples include:
- Server crash
- Hardware malfunction
- Connection failures
- Third-party application failures
- Monitor testing and repeat test scenarios being as creative with failure scenarios as possible.
Benefits of Chaos Engineering & Chaos Testing
Chaos engineering benefits an organization by identifying server and application vulnerabilities, integration failures, and system crashes before the customer experience is impacted. The production system continues to perform as expected with each new release regardless of the nature of the changes or updates.
Other benefits of chaos engineering include:
- Faster issue identification and correction not captured by other QA testing efforts.
- Fewer unplanned outages and downtime.
- Provides ongoing system monitoring on the production server.
- Increases test depth and coverage with controlled testing in production.
QA and Chaos Testing – Mix & Match
Chaos engineering appears similar to stress, load, and performance testing. However, the primary purpose is chaos or the randomness of the testing. For example, in chaos engineering, the system’s optimal or baseline state is set. Then, testers consider potential weaknesses and the effects of those on the customer experience and create a test scenario for each. Each test is then executed with assistance from DevOps and with resources available to repair the production server when tests successfully find problems.
In other types of performance testing, the application performance is tested when running on a test or development server. Often functional application tests are transformed into performance tests based on the user workflow. In a typical performance, stress, or load test, testers execute based on known factors against an expected result, rather than crash or cause production server failures.
Chaos engineering also must involve IT or DevOps to manage issues on the production server. If failures are caused by testing in a blast radius, resources must be ready to reinstate the production server as needed.
It’s common for a DevOps engineer to execute chaos engineering testing. However, there’s no reason QA testers cannot also design and execute chaos engineering testing. Coordination and cooperation between QA testing and DevOps during testing are key. QA testers have the skills to break software including hardware and backend connections, but they may not have the skills to restore the production server to normal operations rapidly. Leverage the QA tester’s ability and desire to break software to the business’s advantage with chaos engineering.
Mix and match QA testing resources with DevOps to ensure optimal chaos test development, execution, and support when testing in production. Add chaos test scenarios to scheduled regression testing even on a test server. Determine what all can be tested first on the test servers and then move into production. Adding chaos tests improves the depth and test coverage of QA testing while providing business value.
Testing with a Blast Radius
Using a blast radius enables production level testing without negatively impacting the production server or taking it down completely. Designate distinct blast radius zones for similar functions. Next, group test scenarios into their related blasting zones. Executing tests by blast radius ensures failure to control and reduces the possibility of unexpectedly and completely crashing the production server.
During chaos engineering testing, expect disruption. Coordinating efforts between IT, DevOps and QA testing is critical to minimize adverse effects on the production server and the customer experience. Ensure redundancy measures are in place to keep the server operational when chaos engineering testing causes issues.
One basic blast radius worth considering is the timing of test execution. Execute tests at non-peak periods to minimize performance impact on customers.
Chaos engineering creates real-world hardware, distributed software, and application failures in distributed systems. Chaos provides deeper testing into the vulnerabilities present in complex, integrated computer systems and the hardware they use. The purpose of chaos engineering is to ensure production server integrity.
Chaos engineering improves customer experience by reducing the number of failures or system crashes possible or present in production. Chaos engineering testing is executed by DevOps or QA testing teams on production servers with resources ready and able to keep production running in case of issues. The key to success is coordination and cooperation between DevOps and QA testing teams. Chaos works better by leveraging operational, test development, and defect-finding skills. Eliminate downtime on production and disruptions to the customer experience by executing chaos testing frequently.