Skip to content

Latest commit

 

History

History
107 lines (92 loc) · 14.1 KB

README.md

File metadata and controls

107 lines (92 loc) · 14.1 KB

Cloud Incident Literature

Cloud (including microservices) incidents/failures related work.

Study

  • How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service. Supriyo Ghosh, et al. SoCC'22 paper
  • Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. Shutian Luo, et al. SoCC'21 paper
  • What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang et al. DSN'17 paper
  • A Survey on Failure Analysis and Fault Injection in AI Systems. Guangba Yu, et al. paper
  • Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. Haryadi S. Gunawi, et al. SoCC'16 paper
  • What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Haryadi S. Gunawi, et al. SoCC'14 paper

Benchmarks

  • Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications. Vaastav Anand, et al. SOSP'23 paper
  • (DeathStarBench) An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. Yu Gan, et al. ASPLOS'19 paper
  • (TrainTicket) Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. Xiang Zhou, et al. IEEE Transactions on SE'18 paper
  • µSuite: A Benchmark Suite for Microservices. Akshitha Sriraman, et al. IISWC'18 paper

Monitoring & Tracing

  • Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. Xiaofeng Guo, et al. ESEC/FSE'20 paper
  • CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. Xiao Yu, et al. ASPLOS'16 paper
  • Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. Jonathan Mace, et al. SOSP'15 paper
  • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Michael Chow, et al. OSDI'14 paper
  • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Benjamin H. Sigelman, et al. Technical report'10 paper
  • X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, et al. NSDI'07 paper

Detection & Triage

  • (AnoFusion) Robust Multimodal Failure Detection for Microservice Systems. Chenyu Zhao, el al. KDD'23 paper
  • (MSTGAD) Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System. Jun Huang, et al. ASE'23 paper
  • Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. Chenxi Zhang, et al. ICSE'22 paper
  • Fighting the Fog of War: Automated Incident Detection for Cloud Systems. Liqun Li, et al. ATC'21 paper
  • FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices. Haoran Qiu, et al. OSDI'20 paper
  • Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. Ping Liu, et al. ISSRE'20 paper
  • Continuous Incident Triage for Large-Scale Online Service Systems. Junjie Chen, et al. ASE'19 paper
  • An Empirical Investigation of Incident Triage for Online Service Systems. Junjie Chen, et al. ICSE-SEIP'19 paper

Root Causing

  • Fault Diagnosis for Test Alarms in Microservices through Multi-source Data. Shenglin Zhang, et al. FSE(Industry)'24 paper
  • ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems. Guangba Yu, et al. FSE'24 paper
  • BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection. Luan Pham, et al. FSE'24 paper
  • A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization. Sungmin Kang, et al. FSE'24 paper
  • Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph. Zhenhe Yao, et al. FSE'24 paper
  • Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation. Md Nakhla Rafi, et al. FSE'24 paper
  • Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems. Shenglin Zhang, et al. FSE'24 paper
  • Exploring LLM-based Agents for Root Cause Analysis. Devjeet Roy, et al. arXiv'24 paper
  • RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. Zefan Wang, et al. arXiv'23 paper
  • PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. Shizhuo Dylan Zhang, et al. FSE(Industry)'24 paper
  • Assess and summarize: Improve outage understanding with large language models. Pengxiang Jin, et al. arXiv'23 paper
  • (RCACopilot) Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. Yinfang Chen, et al. EuroSys'24 paper
  • GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications. Gagan Somashekar, et al. WWW'24 paper
  • Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data. Guangba Yu, et al. FSE'23 paper
  • Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. Cheryl Lee, et al. ICSE'23 paper
  • (DiagFusion) Robust Failure Diagnosis of Microservice System through Multimodal Data. Shenglin Zhang, et al. arXiv'23 paper
  • Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. Toufique Ahmed, et al. ICSE'23 paper
  • Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps. Amrita Saha, et al. arXiv'22 paper
  • Scalable Statistical Root Cause Analysis on App Telemetry. Vijayaraghavan Murali, et al. ICSE-SEIP'21 paper
  • Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices. Yu Gan, et al. ASPLOS'21 paper
  • Groot: An event-graph-based approach for root cause analysis in industrial settings. Hanzhang Wang, et al. ASE'21 paper
  • Practical Root Cause Localization for Microservice Systems via Trace Analysis. Zeyan Li, et al. IWQOS'21 paper
  • CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. Yingying Zhang, et al. CIKM'21 paper
  • MicroHECL: high-efficient root cause localization in large-scale microservice systems. Dewei Liu, et al. ICSE-SEIP'21 paper
  • Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution. Yangguang Li, et al. ACM Transactions on Software Engineering and Methodology'20 paper
  • Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. Yu Gan, et al. ASPLOS'19 paper
  • Latent error prediction and fault localization for microservice applications by learning from system trace logs. Xiang Zhou, et al. ESEC/FSE'19 paper
  • Automated known problem diagnosis with event traces. Chun Yuan, et al. EuroSys'06 paper
  • Delta Debugging Microservice Systems. Xiang Zhou, et al. ASE'18 paper

Mitigation

  • How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems. Jiajun Jiang, et al. FSE'20 paper
  • AutoTSG: Learning and Synthesis for Incident Troubleshooting. Manish Shetty, et al. arXiv'22 paper

Fault Injection for Cloud

  • MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications. Hongyang Chen, et al. TDSC'24 paper
  • Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay. Yuanliang Chen, et al. S&P'24 paper code
  • Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. Jiawei Tyler Gu, et al. SOSP'23 paper code
  • Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos. Fuchen Ma, et al. CCS'23 paper
  • Coverage Guided Fault Injection for Cloud Systems. Yu Gao, et al. ICSE'23 paper code
  • Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. Yinfang Chen, et al. NSDI'23 paper code
  • Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. Lilia Tang, et al. Eurosys'23 paper [code]
  • Automatic Reliability Testing for Cluster Management Controllers. Xudong Sun et al. OSDI'22 paper code
  • IBIR: Bug Report driven Fault Injection. Ahmed Khanfir et al. FSE'22 paper code
  • SlowCoach Mutating Code to Simulate Performance Bugs. Yiqun Chen, et al. ISSRE'22 paper
  • Understanding a Program’s Resiliency Through Error Propagation. Zhimin Li, et al. PPoPP'21 paper
  • CoFI: Consistency-Guided Fault Injection for Cloud Systems. Haicheng Chen, et al. ASE'20 paper code
  • How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method. Yong Yang, et al. ISSRE'20 paper
  • ProFIPy: Programmable Software Fault Injection as-a-Service. Roberto Natella, et al. DSN'20 paper
  • Fitness-guided Resilience Testing of Microservice-based Applications. Zhenyue Long, et al. ICWS'20 paper
  • Co-evolving Tracing and Fault Injection with Box of Pain. Daniel Bittman, et al. HotCloud'19 paper
  • Automating Failure Testing Research at Internet Scale. Peter Alvaro, et al. SoCC'16 paper