Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIOps Project Reading list #2

Open
3 of 72 tasks
JacksonArthurClark opened this issue Dec 16, 2024 · 1 comment
Open
3 of 72 tasks

AIOps Project Reading list #2

JacksonArthurClark opened this issue Dec 16, 2024 · 1 comment

Comments

@JacksonArthurClark
Copy link

JacksonArthurClark commented Dec 16, 2024

AIOps Reading list

Now that we've committed to reading a few papers a week from this list I figured we could keep a checklist of the papers we've read.

Study

  • How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service. Supriyo Ghosh, et al. SoCC'22 paper
  • Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. Shutian Luo, et al. SoCC'21 paper
  • What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang et al. DSN'17 paper
  • A Survey on Failure Analysis and Fault Injection in AI Systems. Guangba Yu, et al. paper
  • Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. Haryadi S. Gunawi, et al. SoCC'16 paper
  • What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. Haryadi S. Gunawi, et al. SoCC'14 paper

Benchmarks

  • Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications. Vaastav Anand, et al. SOSP'23 paper
  • (DeathStarBench) An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. Yu Gan, et al. ASPLOS'19 paper
  • (TrainTicket) Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. Xiang Zhou, et al. IEEE Transactions on SE'18 paper
  • µSuite: A Benchmark Suite for Microservices. Akshitha Sriraman, et al. IISWC'18 paper

Monitoring & Tracing

  • Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. Xiaofeng Guo, et al. ESEC/FSE'20 paper
  • CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs. Xiao Yu, et al. ASPLOS'16 paper
  • Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems. Jonathan Mace, et al. SOSP'15 paper
  • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services. Michael Chow, et al. OSDI'14 paper
  • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Benjamin H. Sigelman, et al. Technical report'10 paper
  • X-Trace: A Pervasive Network Tracing Framework. Rodrigo Fonseca, et al. NSDI'07 paper

Detection & Triage

  • (AnoFusion) Robust Multimodal Failure Detection for Microservice Systems. Chenyu Zhao, el al. KDD'23 paper
  • (MSTGAD) Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System. Jun Huang, et al. ASE'23 paper
  • Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. Chenxi Zhang, et al. ICSE'22 paper
  • Fighting the Fog of War: Automated Incident Detection for Cloud Systems. Liqun Li, et al. ATC'21 paper
  • FIRM: An Intelligent Fine-Grained Resource Management Frameworkfor SLO-Oriented Microservices. Haoran Qiu, et al. OSDI'20 paper
  • Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. Ping Liu, et al. ISSRE'20 paper
  • Continuous Incident Triage for Large-Scale Online Service Systems. Junjie Chen, et al. ASE'19 paper
  • An Empirical Investigation of Incident Triage for Online Service Systems. Junjie Chen, et al. ICSE-SEIP'19 paper

Root Causing

  • Fault Diagnosis for Test Alarms in Microservices through Multi-source Data. Shenglin Zhang, et al. FSE(Industry)'24 paper
  • ChangeRCA: Finding Root Causes from Software Changes in Large Online Systems. Guangba Yu, et al. FSE'24 paper
  • BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection. Luan Pham, et al. FSE'24 paper
  • A Quantitative and Qualitative Evaluation of LLM-Based Explainable Fault Localization. Sungmin Kang, et al. FSE'24 paper
  • Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph. Zhenhe Yao, et al. FSE'24 paper
  • Towards Better Graph Neural Network-Based Fault Localization through Enhanced Code Representation. Md Nakhla Rafi, et al. FSE'24 paper
  • Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems. Shenglin Zhang, et al. FSE'24 paper
  • Exploring LLM-based Agents for Root Cause Analysis. Devjeet Roy, et al. arXiv'24 paper
  • RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. Zefan Wang, et al. arXiv'23 paper
  • PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. Shizhuo Dylan Zhang, et al. FSE(Industry)'24 paper
  • Assess and summarize: Improve outage understanding with large language models. Pengxiang Jin, et al. arXiv'23 paper
  • (RCACopilot) Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. Yinfang Chen, et al. EuroSys'24 paper
  • GAMMA: Graph Neural Network-Based Multi-Bottleneck Localization for Microservices Applications. Gagan Somashekar, et al. WWW'24 paper
  • Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data. Guangba Yu, et al. FSE'23 paper
  • Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. Cheryl Lee, et al. ICSE'23 paper
  • (DiagFusion) Robust Failure Diagnosis of Microservice System through Multimodal Data. Shenglin Zhang, et al. arXiv'23 paper
  • Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. Toufique Ahmed, et al. ICSE'23 paper
  • Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps. Amrita Saha, et al. arXiv'22 paper
  • Scalable Statistical Root Cause Analysis on App Telemetry. Vijayaraghavan Murali, et al. ICSE-SEIP'21 paper
  • Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices. Yu Gan, et al. ASPLOS'21 paper
  • Groot: An event-graph-based approach for root cause analysis in industrial settings. Hanzhang Wang, et al. ASE'21 paper
  • Practical Root Cause Localization for Microservice Systems via Trace Analysis. Zeyan Li, et al. IWQOS'21 paper
  • CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms. Yingying Zhang, et al. CIKM'21 paper
  • MicroHECL: high-efficient root cause localization in large-scale microservice systems. Dewei Liu, et al. ICSE-SEIP'21 paper
  • Predicting Node Failures in an Ultra-large-scale Cloud Computing Platform: an AIOps Solution. Yangguang Li, et al. ACM Transactions on Software Engineering and Methodology'20 paper
  • Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices. Yu Gan, et al. ASPLOS'19 paper
  • Latent error prediction and fault localization for microservice applications by learning from system trace logs. Xiang Zhou, et al. ESEC/FSE'19 paper
  • Automated known problem diagnosis with event traces. Chun Yuan, et al. EuroSys'06 paper
  • Delta Debugging Microservice Systems. Xiang Zhou, et al. ASE'18 paper

Mitigation

  • How to Mitigate the Incident? An Effective Troubleshooting Guide Recommendation Technique for Online Service Systems. Jiajun Jiang, et al. FSE'20 paper
  • AutoTSG: Learning and Synthesis for Incident Troubleshooting. Manish Shetty, et al. arXiv'22 paper

Fault Injection for Cloud

  • MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice Applications. Hongyang Chen, et al. TDSC'24 paper
  • Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay. Yuanliang Chen, et al. S&P'24 paper code
  • Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. Jiawei Tyler Gu, et al. SOSP'23 paper code
  • Phoenix: Detect and Locate Resilience Issues in Blockchain via Context-Sensitive Chaos. Fuchen Ma, et al. CCS'23 paper
  • Coverage Guided Fault Injection for Cloud Systems. Yu Gao, et al. ICSE'23 paper code
  • Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. Yinfang Chen, et al. NSDI'23 paper code
  • Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. Lilia Tang, et al. Eurosys'23 paper [code]
  • Automatic Reliability Testing for Cluster Management Controllers. Xudong Sun et al. OSDI'22 paper code
  • IBIR: Bug Report driven Fault Injection. Ahmed Khanfir et al. FSE'22 paper code
  • SlowCoach Mutating Code to Simulate Performance Bugs. Yiqun Chen, et al. ISSRE'22 paper
  • Understanding a Program’s Resiliency Through Error Propagation. Zhimin Li, et al. PPoPP'21 paper
  • CoFI: Consistency-Guided Fault Injection for Cloud Systems. Haicheng Chen, et al. ASE'20 paper code
  • How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method. Yong Yang, et al. ISSRE'20 paper
  • ProFIPy: Programmable Software Fault Injection as-a-Service. Roberto Natella, et al. DSN'20 paper
  • Fitness-guided Resilience Testing of Microservice-based Applications. Zhenyue Long, et al. ICWS'20 paper
  • Co-evolving Tracing and Fault Injection with Box of Pain. Daniel Bittman, et al. HotCloud'19 paper
  • Automating Failure Testing Research at Internet Scale. Peter Alvaro, et al. SoCC'16 paper
@JacksonArthurClark
Copy link
Author

Upgrade failures are very common and potentially worth investigating and adding to the benchmark. I found this paper from SOSP which provides a good study.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant