This repo contains the source code of (1) Xinda, an automated slow-fault testing pipeline, and (2) ADR, a lightweight runtime slow-fault detection library. The following sections are for building and running Xinda. More information about ADR can be found here.
Xinda is designed to be a flexible and extensible slow-fault testing pipeline for distributed systems. It automates the process of initializing a distributed cluster, running cloud benchmarks, injecting flexible and fine-grained slow faults, collecting runtime logs and stats, and analyzing the results. It can be extended to support new fault injection methods, benchmarks, and distributed systems.
(Recommended) The easiest way to configure and deploy Xinda is to use ansible-playbook on CloudLab c220g2 nodes. A more detailed guideline is available here.
To build and install Xinda manually
- OS: Ubuntu 18.04
- Hardware:
- An SSD is required to mount the docker directory (
/var/lib/docker
by default). - As a reference, our evaluation runs on CloudLab c220g2 node type, which has two Intel E5-2660 v3 10-core CPUs at 2.60 GHz, 160GB ECC DDR4 2133 MHz memory, and a 480GB Intel DC SATA SSD plus two 1.2TB 10K RPM 6G SAS HDDs for storage.
- An SSD is required to mount the docker directory (
- Software:
- xinda-software
- Python (==3.6.13). For data processing:
- pandas (==2.2.2)
- tqdm (==4.66.2)
- Blockade (==0.4.0, for injecting network-related slow faults)
- CharybdeFS for injecting filesystem-related slow faults:
Applying Xinda to a system involves two steps: (1) configuring Xinda and running the test experiment using main.py; (2) analyzing the test results using data-analysis/process.py. We list the detailed steps of using Xinda here.
Let's start by running a simple Xinda test on HBase as a minimal working example (MWE). More MWEs can be found here. We will inject a 1ms network delay to the regionserver for 60s:
python3 main.py \
--sys_name hbase \
--log_root_dir $HOME/workdir/data/example \
--data_dir sample_test \
--fault_type nw \
--fault_location hbase-regionserver \
--fault_duration 60 \
--fault_severity slow-1ms \
--fault_start_time 60 \
--bench_exec_time 150 \
--benchmark ycsb \
--ycsb_wkl mixed \
--iter 1
- Xinda will first set up an HBase cluster, wait till initialization finishes, and load/run a YCSB benchmark (
--benchmark
and--ycsb_wkl
) for 150s (--bench_exec_time
). - After 60s (
--fault_start_time
) of the benchmark, Xinda will inject the preset slow fault (--fault_type
,--fault_severity
,--fault_location
) and then clear it after 60s (--fault_duration
). - After benchmark ends, Xinda will save system logs and runtime stats to
$HOME/workdir/data/example/hbase/sample_test
(--log_root_dir
and--data_dir
). Finally, Xinda will safely shutdown the cluster and do the cleanup.
Now, let's analyze the test results (--data_dir
) using process.py
python3 $HOME/workdir/xinda/data-analysis/process.py \
--data_dir $HOME/workdir/data/example \
--output_dir $HOME/workdir/parsed_results
The parsed results will be stored in $HOME/workdir/parsed_results
(--output_dir
).
Running a single Xinda test usually takes minutes (depending on system init time and --bench_exec_time
). In many cases, we would like to test with hundreds or even thousands of configurations. For example, say we want to test a system with 10 fault severity levels, 2 fault locations and 3 benchmark workloads; each test repeats 10 times for generalizability. This will results in
Thus, we also provide scripts that can generate and execute batched tests in parallel on a cluster of test nodes. A detailed tutorial can be found here.
Xinda is modularized and extensible to incorporate new fault injection methods, running new benchmarks, or testing new distributed systems. We provide a detailed guide on how to extend Xinda here.
Thank you for your interest in Xinda and ADR! We greatly value your feedback and contributions. If you would like to report a bug, suggest an enhancement, or ask any questions, please submit a GitHub Issue. For code contributions, feel free to open a Pull Request.
If you find Xinda and ADR useful, please consider citing our paper:
@inproceedings{SlowFaultStudy2025NSDI,
author = {Lu, Ruiming and Lu, Yunchi and Jiang, Yuxuan and Xue, Guangtao and Huang, Peng},
title = {One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems},
booktitle = {Proceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation},
series = {NSDI '25},
month = {April},
year = {2025},
location = {Philadelphia, PA, USA},
}