Skip to content

Failures and Fixes A Study of Software System Incident Response

Introduction

通过对interview和public的incident response document的分析,得到了一些observation。关于故障是如何发生,检测,分析和处理的。

Thinking and Conclusion

故障如何发生

Observation 1: Test cases often fail to detect defects that lead to incidents only when (possibly rare) combinations of events or system states coincide

Observation 2: Testing environments and other preproduction environments often do not capture all aspects of the production environment.

Observation 3: When scaling limits are not well known, tested or monitored for they are discovered when they are exceeded

Observation 4: Configuration changes are just as risky as code changes, but are often not tested and deployed with the same care

检测

Observation 5: Generic catch all monitoring and notification are important but tend to be trailing metrics, leading to late detection.

Observation 6: Pre-determined, threshold based detection is fragile and incomplete.

Observation 7: Monitoring, notification and other support systems may not themselves be as well tested or monitored as primary systems

排障

Observation 9: Architectural complexity has an operational (including investigative) cost

故障恢复

Observation 10: Addressing the original root cause is not always sufficient mitigation; the mitigation does not always cascade the way the failure did.

Observation 11: Changes made in the context of incident response have the potential to make issues worse (and are made with fewer precautions than typical).


Last update : February 13, 2023
Created : February 13, 2023