This is the code repository of DBAIOps, a novel database operation and maintenance (O&M) system that integrates expert experience to achieve practical anomaly detection, diagnosis, and recovery. Specifically, it conducts database O&M based on the following techniques:
- (1) O&M Maintenance Graph: a graph-based experience model to represent various possible diagnosis paths within O&M experience;
- (2) Correlation-Aware Anomaly Model: capture implicit correlations across metrics and real-world anomalies and initialize knowledge graph exploration during online diagnosis;
- (3) O&M Path Exploration: a two-stage graph evolution mechanism that adaptively explores possible diagnosis paths for the input anomaly;
- (4) Reasoning LLM Based Diagnosis: prompt reasoning LLM to reason over these diagnosis paths and generate diagnosis reports with specific recovery solutions.
Evaluation over four mainstream database systems demonstrates that DBAIOps outperforms state-of-the-art baselines, 47.85% higher in diagnosis accuracy and report interpretability. The structure is as follows:
Note that the full content about the case study of the diagnosis reports can be found at the Appedinx_Diagnosis_Report.
├── bin # DBAIOps installation and deployment scripts
├── webserver # Frontend webserver configuration
├── fstaskpkg # Task platform configurations
├── knowl # Knowledge graph base scripts
├── colscript # Python3 collection scripts (health & cib metrics)
├── pgconf # PostgreSQL initialization config for DSmart metadata
└── phantomjsconf # Frontend webserver JS plugin configurations
| Purpose | Hardware Specs | Supported OS | IP Address | Qty | Notes |
|---|---|---|---|---|---|
| Database, Task Platform, Web Server | 16C/32GB/600GB | RedHat 7.2+ (YUM configured) CentOS 7.2+ (YUM configured) SUSE 12 SP4 (zypper configured) Kylin V10 SP1/SP2/SP3 |
1 | 1 | No restrictions |
| Purpose | Hardware Specs | Supported OS | IP Address | Qty | Notes |
|---|---|---|---|---|---|
| Database Server | 16C/64GB/1TB | RedHat 7.2+ CentOS 7.2+ SUSE 12 SP4 |
3 | 1 | Physical server required |
| Web Server & Task Platform | 16C/64GB/300GB | RedHat 7.2+ CentOS 7.2+ SUSE 12 SP4 |
3 | 3 | No restrictions |
Notes:
- DBAIOps database server can be installed in any directory (only disk space required);
- Distributed installation requires root SSH passwordless login between DBAIOps servers during setup (can be disabled post-installation). Single-node installation doesn't require this.
Between Ops machine and DBAIOps servers:
- Required ports: 22, 18081, 18090, 28080
Between DBAIOps servers and monitored databases:
- Oracle: 11521, 1521
- MySQL: 3306
- PostgreSQL: 5432
- DM8: 5236
Upload installation package to DBAIOps server
Configure YUM repo:
mount /dev/sr0 /mnt
vi /etc/yum.repos.d/local.repo
Install dependencies:
cd /usr/software/bin
./DBAIOps-system-package.sh -install
cd /usr/software/bin
./DBAIOps.sh -installSelect deployment type [Default is single-node]:
- Single-node deployment
- Distributed deployment (requires manual role.cfg configuration)
Choose OS type [Select according to your OS]:
- RedHat
- CentOS
- Installation typically takes 1-1.5 hours depending on system performance
- RPM packages will be automatically installed
After successful installation, restart DBAIOps services:
/usr/software/bin/DBAIOps.sh -restart| Command | Description |
|---|---|
./DBAIOps.sh -start |
Start all DBAIOps services |
./DBAIOps.sh -stop |
Stop all DBAIOps services |
./DBAIOps.sh -status |
Check service status |
./DBAIOps.sh -restart |
Restart all services |
Web Interface:
http://<SERVER_IP>:18081/DBAIOps
Default Credentials:
Username: admin
Password: admin@123
Security Notice:
- Change the default password immediately after first login;
- Recommended to use a strong password following your organization's security policy;
- For production environments, consider implementing additional security measures (IP restrictions, 2FA, etc.).
Anomaly: LOG SYNCHRONIZATION DELAY, which detects slowdowns caused by log-writing operations. When a transaction commits or rolls back, forcing a session to wait for the log writer to flush redo logs, the system can degrade significantly if I/O capacity is insufficient.
Abnormal Metrics List:
| Metric ID | Metric Name | Min | Max | Avg |
|---|---|---|---|---|
| 2184301 | log file sync average wait time | 1.08 ms | 15.2 ms | 6.0 ms |
| 2184305 | log file parallel write average wait time | 0.78 ms | 7.09 ms | 3.0 ms |
| 000014 | Mem unavailable | 72.82 | 73.07 | 73.0 |
Root Cause:
(1) Insufficient I/O Performance of REDO Log Storage:
- Evidence Chains:
- Metric 2184301 (log file sync): max = 15.2 ms, avg = 6.0 ms (abnormal)
- Metric 2184305 (log file parallel write): max = 7.09 ms, avg = 3.0 ms (abnormal)
- Anomaly Analysis:
- According to the formula: log_file_sync / log_file_parallel_write = 6.0 / 3.0 = 2
- Ratio in range (1.5, 2] ⇒ Storage I/O performance is a primary factor
(2) Intermittent I/O Pressure Spikes During Log Writing:
- Evidence Chains:
- At 06:00, log file sync spiked to 15.2 ms (vs. hourly avg 4.54 ms)
- log file parallel write peaked at 7.09 ms (high for mechanical disks)
- OS I/O latency = 0.47 ms (normal)
- Anomaly Analysis:
- Storage likely under transient load
✅ The predefined hypothesis of the experiment is that the bottleneck lies in I/O storage.
✅ Both root causes align with this hypothesis (first evaluation criterion satisfied)✅ Evidence chains are logically consistent with Oracle/OS theory (second criterion)
✅ All data comes from actual experiments with no hallucinations (third criterion)
🎯 Final Accuracy Assessment: 100%