Harnessing Chaos: The Role of Chaos Engineering in Cloud Applications and Impacts on Site Reliability Engineering

© 2024 by IJCTT Journal
Volume-72 Issue-6
Year of Publication : 2024
Authors : Rahul Yadav
DOI :  10.14445/22312803/IJCTT-V72I6P104

How to Cite?

Rahul Yadav, "Harnessing Chaos: The Role of Chaos Engineering in Cloud Applications and Impacts on Site Reliability Engineering," International Journal of Computer Trends and Technology, vol. 72, no. 6, pp. 25-30, 2024. Crossref, https://doi.org/10.14445/22312803/IJCTT-V72I6P104

In the ever-evolving landscape of cloud computing, where reliability and resilience are paramount, the concept of chaos might seem counterintuitive. However, within this realm, chaos is not only embraced but actively harnessed as a means of ensuring systems are robust and capable of withstanding unexpected failures. At the heart of this approach lies a powerful methodology known as Chaos Engineering. Chaos Engineering is a disciplined approach to experimenting on distributed systems to build confidence in their resilience. It involves intentionally introducing controlled disruptions or failures into a system to observe how it responds under adverse conditions. This paper investigates and examines how Chaos Engineering techniques might be integrated into cloud-based systems and how this can affect Site Reliability Engineering (SRE) techniques. By simulating real-world failures in a controlled environment, organizations can identify weaknesses, uncover hidden dependencies, and improve the overall reliability of their systems.

Azure chaos studio, Cloud technologies, Chaos Engineering, Enterprise architecture, Site reliability engineering.


[1] Principles of Chaos Engineering, Principlesofchaos, 2019. [Online]. Available: http://principlesofchaos.org/?lang=ENcontent
[2] Resilience, Cambridge Dictionary, 2020. [Online]. Available: https://dictionary.cambridge.org/dictionary/english/resilience
[3] Russ Miles, Chaos Engineering Observability, O'Reilly Media, 2019.
[Google Scholar] [Publisher Link]
[4] Betsy Beyer et al., Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, 2016.
[Google Scholar] [Publisher Link]
[5] Chaos Engineering, AWS Solutions Library. [Online]. Available: https://aws.amazon.com/solutions/resilience/chaos-engineering/
[6] Find and Fix Your Reliability Risks, Gremlin. [Online]. Available: https://www.gremlin.com/
[7] Open Source Chaos Engineering Platform, LitmusChaos. [Online]. Available: https://litmuschaos.io/
[8] Ali Basiri et al., “Chaos Engineering,” IEEE Software, vol. 33, no. 3, pp. 35-41, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Ali Basiri et al., “Automating Chaos Experiments in Production,” 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, Montreal, QC, Canada, pp. 31-40, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Quickstart: Create and Run a Chaos Experiment by Using Azure Chaos Studio, Microsoft, pp. 1-287, 2023. [Online]. Available: https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-quickstart-azure-portal