Skip to content

Introduction to resilience engineering concepts for software engineers

Notifications You must be signed in to change notification settings

anobil/resilience-for-software

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Resilience engineering for software: a FAQ

What is resilience engineering?

Resilience engineering can be viewed as a set of high-leverage approaches to managing failures in complex socio-technical systems -- which makes it a domain relevant to many technology companies.

Failure in complex systems is itself a complex subject. The paper How Complex Systems Fail by Richard Cook is an excellent short introduction. For a higher fidelity definition, see John Allspaw’s talk Resilience Engineering: The What and How.

Complex systems have bounded resources and you probably have limited time. Expect resilience engineering to be highest leverage when failures (e.g., incidents) are substantially impacting the sustainability of your systems, the happiness of your engineers, your ability to meet business needs, and/or the happiness of your customers (framing borrowed from Honeycomb).

What is the relationship between resilience engineering and DevOps/SRE?

DevOps’ approach to safety focuses on mitigating the impact of known modes of failure -- “known unknowns” like bad deploys, host failures, etc. Resilience engineering is concerned with the ability to adapt to unknown unknowns -- for example, how did your organization respond to the decades old latent flaw that became Spectre/Meltdown?

For further insight you can read the preface and conclusion chapters (of course, more if you're inspired) of Accelerate by Nicole Forsgren PhD, Jez Humble, and Gene Kim and Sustainable Operations in Complex Systems with Production Excellence by Liz Fong-Jones, and comparing the perspectives there with those of Cook and Allspaw expressed in the resources linked above.

I’m intrigued, what are my next steps?

For most software organizations, the low hanging fruit of resilience engineering will be learning more from incidents by improving the postmortem process. Etsy has an excellent guide.

Beyond that, if you want to learn more about the domain and enjoy reading academic papers, see Lorin Hochstein’s paper-centric introduction. If you prefer conference talks, John Allspaw has curated a YouTube playlist. Nora Jones also runs a resilience engineering focused Slack Community, Learning From Incidents in Software.

What should I do if I have other questions?

Your options include

  1. Open an issue on this repo
  2. Reaching out to Lorin Hochstein, Jacob Scott, or others in the resilience engineering community on Twitter
  3. Contacting Allspaw, Cook, and Wood’s consultancy, Adaptive Capacity Labs, if you want to work with subject matter experts in a professional/contractual setting

About

Introduction to resilience engineering concepts for software engineers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published