You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that we've committed to reading a few papers a week from this list I figured we could keep a checklist of the papers we've read.
Study
How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service. Supriyo Ghosh, et al. SoCC'22 paper
Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. Shutian Luo, et al. SoCC'21 paper
What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang et al. DSN'17 paper
A Survey on Failure Analysis and Fault Injection in AI Systems. Guangba Yu, et al. paper
Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. Haryadi S. Gunawi, et al. SoCC'16 paper
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Haryadi S. Gunawi, et al. SoCC'14 paper
Benchmarks
Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications. Vaastav Anand, et al. SOSP'23 paper
(DeathStarBench) An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. Yu Gan, et al. ASPLOS'19 paper
(TrainTicket) Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. Xiang Zhou, et al. IEEE Transactions on SE'18 paper
µSuite: A Benchmark Suite for Microservices. Akshitha Sriraman, et al. IISWC'18 paper
Monitoring & Tracing
Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. Xiaofeng Guo, et al. ESEC/FSE'20 paper
CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. Xiao Yu, et al. ASPLOS'16 paper
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. Jonathan Mace, et al. SOSP'15 paper
The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Michael Chow, et al. OSDI'14 paper
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Benjamin H. Sigelman, et al. Technical report'10 paper
X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, et al. NSDI'07 paper
Detection & Triage
(AnoFusion) Robust Multimodal Failure Detection for Microservice Systems. Chenyu Zhao, el al. KDD'23 paper
(MSTGAD) Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System. Jun Huang, et al. ASE'23 paper
Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. Chenxi Zhang, et al. ICSE'22 paper
Fighting the Fog of War: Automated Incident Detection for Cloud Systems. Liqun Li, et al. ATC'21 paper
FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices. Haoran Qiu, et al. OSDI'20 paper
Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. Ping Liu, et al. ISSRE'20 paper
Continuous Incident Triage for Large-Scale Online Service Systems. Junjie Chen, et al. ASE'19 paper
An Empirical Investigation of Incident Triage for Online Service Systems. Junjie Chen, et al. ICSE-SEIP'19 paper
Root Causing
Fault Diagnosis for Test Alarms in Microservices through Multi-source Data. Shenglin Zhang, et al. FSE(Industry)'24 paper
ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems. Guangba Yu, et al. FSE'24 paper
BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection. Luan Pham, et al. FSE'24 paper
A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization. Sungmin Kang, et al. FSE'24 paper
Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph. Zhenhe Yao, et al. FSE'24 paper
Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation. Md Nakhla Rafi, et al. FSE'24 paper
Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems. Shenglin Zhang, et al. FSE'24 paper
Exploring LLM-based Agents for Root Cause Analysis. Devjeet Roy, et al. arXiv'24 paper
RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. Zefan Wang, et al. arXiv'23 paper
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. Shizhuo Dylan Zhang, et al. FSE(Industry)'24 paper
Assess and summarize: Improve outage understanding with large language models. Pengxiang Jin, et al. arXiv'23 paper
(RCACopilot) Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. Yinfang Chen, et al. EuroSys'24 paper
GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications. Gagan Somashekar, et al. WWW'24 paper
Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data. Guangba Yu, et al. FSE'23 paper
Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. Cheryl Lee, et al. ICSE'23 paper
(DiagFusion) Robust Failure Diagnosis of Microservice System through Multimodal Data. Shenglin Zhang, et al. arXiv'23 paper
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. Toufique Ahmed, et al. ICSE'23 paper
Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps. Amrita Saha, et al. arXiv'22 paper
Scalable Statistical Root Cause Analysis on App Telemetry. Vijayaraghavan Murali, et al. ICSE-SEIP'21 paper
Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices. Yu Gan, et al. ASPLOS'21 paper
Groot: An event-graph-based approach for root cause analysis in industrial settings. Hanzhang Wang, et al. ASE'21 paper
Practical Root Cause Localization for Microservice Systems via Trace Analysis. Zeyan Li, et al. IWQOS'21 paper
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. Yingying Zhang, et al. CIKM'21 paper
MicroHECL: high-efficient root cause localization in large-scale microservice systems. Dewei Liu, et al. ICSE-SEIP'21 paper
Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution. Yangguang Li, et al. ACM Transactions on Software Engineering and Methodology'20 paper
Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. Yu Gan, et al. ASPLOS'19 paper
Latent error prediction and fault localization for microservice applications by learning from system trace logs. Xiang Zhou, et al. ESEC/FSE'19 paper
Automated known problem diagnosis with event traces. Chun Yuan, et al. EuroSys'06 paper
Delta Debugging Microservice Systems. Xiang Zhou, et al. ASE'18 paper
Mitigation
How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems. Jiajun Jiang, et al. FSE'20 paper
AutoTSG: Learning and Synthesis for Incident Troubleshooting. Manish Shetty, et al. arXiv'22 paper
Fault Injection for Cloud
MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications. Hongyang Chen, et al. TDSC'24 paper
Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay. Yuanliang Chen, et al. S&P'24 papercode
Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. Jiawei Tyler Gu, et al. SOSP'23 papercode
Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos. Fuchen Ma, et al. CCS'23 paper
Coverage Guided Fault Injection for Cloud Systems. Yu Gao, et al. ICSE'23 papercode
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. Yinfang Chen, et al. NSDI'23 papercode
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. Lilia Tang, et al. Eurosys'23 paper[code]
Automatic Reliability Testing for Cluster Management Controllers. Xudong Sun et al. OSDI'22 papercode
IBIR: Bug Report driven Fault Injection. Ahmed Khanfir et al. FSE'22 papercode
SlowCoach Mutating Code to Simulate Performance Bugs. Yiqun Chen, et al. ISSRE'22 paper
Understanding a Program’s Resiliency Through Error Propagation. Zhimin Li, et al. PPoPP'21 paper
CoFI: Consistency-Guided Fault Injection for Cloud Systems. Haicheng Chen, et al. ASE'20 papercode
How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method. Yong Yang, et al. ISSRE'20 paper
ProFIPy: Programmable Software Fault Injection as-a-Service. Roberto Natella, et al. DSN'20 paper
Fitness-guided Resilience Testing of Microservice-based Applications. Zhenyue Long, et al. ICWS'20 paper
Co-evolving Tracing and Fault Injection with Box of Pain. Daniel Bittman, et al. HotCloud'19 paper
Automating Failure Testing Research at Internet Scale. Peter Alvaro, et al. SoCC'16 paper
The text was updated successfully, but these errors were encountered:
Upgrade failures are very common and potentially worth investigating and adding to the benchmark. I found this paper from SOSP which provides a good study.
AIOps Reading list
Now that we've committed to reading a few papers a week from this list I figured we could keep a checklist of the papers we've read.
Study
Benchmarks
Monitoring & Tracing
Detection & Triage
Root Causing
Mitigation
Fault Injection for Cloud
The text was updated successfully, but these errors were encountered: