-
This is a custom Prometheus and CloudWatch exporter to expose Litmus Chaos metrics. To learn more about Litmus Chaos Experiments & the Litmus Chaos Operator, visit this link: Litmus Docs
-
Typically deployed along with the chaos-operator deployment, which, in-turn is associated with all chaosresults in the cluster.
-
Two types of metrics are exposed:
-
AggregateMetrics: These metrics are derived from the all the chaosresults present inside
WATCH_NAMESPACE
. IfWATCH_NAMESPACE
is not defined then it derived metrics from all namespaces. It exposes total_passed_experiment, total_failed_experiment, total_awaited_experiment, experiment_run_count, experiment_installed_count metrices. -
ExperimentScoped: Individual experiment run status. It exposes passed_experiment, failed_experiment, awaited_experiment, probe_success_percentage, startTime, endTime, totalDuration, chaosInjectTime metrices.
-
Metrics Name | litmuschaos_passed_experiments |
---|---|
Description | It contains total number of passed experiments |
Source | ChaosResult |
Sample Metrics | litmuschaos_passed_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1 |
Notes | The litmuschaos_passed_experiments contains the cumulative sum of passed runs for the given ChaosResult. |
Metrics Name | litmuschaos_failed_experiments |
---|---|
Description | It contains total number of failed experiments |
Source | ChaosResult |
Sample Metrics | litmuschaos_failed_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 0 |
Notes | The litmuschaos_failed_experiments contains the cumulative sum of failed runs for the given ChaosResult. |
Metrics Name | litmuschaos_awaited_experiments |
---|---|
Description | It contains total number of awaited experiments |
Source | ChaosResult |
Sample Metrics | litmuschaos_awaited_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1 |
Notes | The litmuschaos_awaited_experiments denotes the queued experiments for each ChaosResult. It contains the value as 1 if the ChaosResult's verdict is Awaited otherwise it's value is 0. |
Metrics Name | litmuschaos_probe_success_percentage |
---|---|
Description | It contains the ProbeSuccessPercentage for the experiment |
Source | ChaosResult |
Sample Metrics | litmuschaos_probe_success_percentage{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 100 |
Notes | The litmuschaos_probe_success_percentage defines the percentage of passed probes out of total probes defined inside the ChaosEngine. |
Metrics Name | litmuschaos_experiment_start_time |
---|---|
Description | It contains the start time of the experiment |
Source | ExperimentDependencyCheck event inside the ChaosEngine |
Sample Metrics | litmuschaos_experiment_start_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618425155e+09 |
Notes | The litmuschaos_experiment_start_time denotes the start time of the experiment, which calculated based on the ExperimentDependencyCheck event(created by the chaos-runner just before launching experiment pod). |
Metrics Name | litmuschaos_experiment_end_time |
---|---|
Description | It contains the end time of the experiment |
Source | Summary event inside the ChaosEngine |
Sample Metrics | litmuschaos_experiment_end_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618425219e+09 |
Notes | The litmuschaos_experiment_end_time denotes the end time of the experiment, which calculated based on the Summary event(created by experiment pod in the end of experiment). |
Metrics Name | litmuschaos_experiment_chaos_injected_time |
---|---|
Description | It contains the chaos injection time of the experiment |
Source | ChaosInject event inside the ChaosEngine |
Sample Metrics | litmuschaos_experiment_chaos_injected_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618425199e+09 |
Notes | The litmuschaos_experiment_chaos_injected_time defines the time duration when chaos is actually injected, which calculated based on the ChaosInject event(created by the experiment/helper pod just before chaos injection). |
Metrics Name | litmuschaos_experiment_total_duration |
---|---|
Description | It contains the total chaos duration of the experiment |
Source | It is time difference b/w startTime and endTime |
Sample Metrics | litmuschaos_experiment_total_duration{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 64 |
Notes | The litmuschaos_experiment_total_duration defines the total chaos duration of the experiment. It is time interval betweeen start time and the end time. |
Metrics Name | litmuschaos_experiment_verdict |
---|---|
Description | It contains the experiment verdict details |
Source | ChaosResult |
Sample Metrics | litmuschaos_experiment_verdict{app_kind="deployment",app_label="run=nginx",app_namespace="nginx",chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus",chaosresult_verdict="Pass",probe_success_percentage="100.000000"} 1 |
Notes | The litmuschaos_experiment_verdict sets the metrics based on the ChaosResult verdict. In case of Awaited verdict it always set to 0. In case of other verdicts it contains value as 1. But if the verdict is repeated more than TSDB_SCRAPE_INTERVAL(passed as ENV) then it will set to 0 until verdict change to a different value. |
Metrics Name | litmuschaos_namespace_scoped_passed_experiments |
---|---|
Description | It contains the total passed experiments count in the WATCH_NAMESPACE |
Source | Aggregated sum of all the litmuschaos_passed_experiments metrics derived from the ChaosResult present inside WATCH_NAMESPACE |
Sample Metrics | litmuschaos_namespace_scoped_passed_experiments 2 |
Notes | The litmuschaos_namespace_scoped_passed_experiments defines the total number of passed experiments in the WATCH_NAMESPACE. It is the summation of litmuschaos_passed_experiments metrics for every ChaosResult present inside the WATCH_NAMESPACE. |
Metrics Name | litmuschaos_namespace_scoped_failed_experiments |
---|---|
Description | It contains the total failed experiments count in the WATCH_NAMESPACE |
Source | Aggregated sum of all the litmuschaos_failed_experiments metrics derived from the ChaosResult present inside WATCH_NAMESPACE |
Sample Metrics | litmuschaos_namespace_scoped_failed_experiments 0 |
Notes | The litmuschaos_namespace_scoped_failed_experiments defines the total number of failed experiments in the WATCH_NAMESPACE. It is the summation of litmuschaos_failed_experiments metrics for every ChaosResult present inside the WATCH_NAMESPACE. |
Metrics Name | litmuschaos_namespace_scoped_awaited_experiments |
---|---|
Description | It contains the total awaited experiments count in the WATCH_NAMESPACE |
Source | Aggregated sum of all the litmuschaos_awaited_experiments metrics derived from the ChaosResult present inside WATCH_NAMESPACE |
Sample Metrics | litmuschaos_namespace_scoped_awaited_experiments 0 |
Notes | The litmuschaos_namespace_scoped_awaited_experiments defines the total number of awaited/queued experiments in the WATCH_NAMESPACE. It is the summation of litmuschaos_awaited_experiments metrics for every ChaosResult present inside the WATCH_NAMESPACE. |
Metrics Name | litmuschaos_namespace_scoped_experiments_run_count |
---|---|
Description | It contains the total experiments run count in the WATCH_NAMESPACE |
Source | Aggregated sum of all the experiments runs in the WATCH_NAMESPACE |
Sample Metrics | litmuschaos_namespace_scoped_experiments_run_count 2 |
Notes | The litmuschaos_namespace_scoped_experiments_run_count defines the total experiment runs in the WATCH_NAMESPACE. It is summation of litmuschaos_passed_experiments + litmuschaos_failed_experiments + litmuschaos_awaited_experiments for every ChaosResult present present inside the WATCH_NAMESPACE. |
Metrics Name | litmuschaos_namespace_scoped_experiments_installed_count |
---|---|
Description | It contains the total unique experiments installed/run in the WATCH_NAMESPACE |
Source | It contains total unique experiments count in the WATCH_NAMESPACE |
Sample Metrics | litmuschaos_namespace_scoped_experiments_installed_count 1 |
Notes | The litmuschaos_namespace_scoped_experiments_installed_count defines the total unique experiments installed/run in the WATCH_NAMESPACE. It is equal to the total number of ChaosResult present inside the WATCH_NAMESPACE. |
Metrics Name | litmuschaos_cluster_scoped_passed_experiments |
---|---|
Description | It contains the total passed experiments count in all the namespaces |
Source | Aggregated sum of all the litmuschaos_passed_experiments metrics derived from the ChaosResult present inside all the namespaces |
Sample Metrics | litmuschaos_cluster_scoped_passed_experiments 2 |
Notes | The litmuschaos_cluster_scoped_passed_experiments defines the total number of passed experiments across the cluster. It is the summation of litmuschaos_passed_experiments metrics for every ChaosResult in all the namespaces. |
Metrics Name | litmuschaos_cluster_scoped_failed_experiments |
---|---|
Description | It contains the total failed experiments count in all the namespaces |
Source | Aggregated sum of all the litmuschaos_failed_experiments metrics derived from the ChaosResult present inside all the namespaces |
Sample Metrics | litmuschaos_cluster_scoped_failed_experiments 0 |
Notes | The litmuschaos_cluster_scoped_failed_experiments defines the total number of failed experiments across the cluster. It is the summation of litmuschaos_failed_experiments metrics for every ChaosResult in all the namespaces. |
Metrics Name | litmuschaos_cluster_scoped_awaited_experiments |
---|---|
Description | It contains the total awaited experiments count in all the namespaces |
Source | Aggregated sum of all the litmuschaos_awaited_experiments metrics derived from the ChaosResult present inside all the namespaces |
Sample Metrics | litmuschaos_cluster_scoped_awaited_experiments 0 |
Notes | The litmuschaos_cluster_scoped_awaited_experiments defines the total number of awaited/queued experiments across the cluster. It is the summation of litmuschaos_awaited_experiments metrics for every ChaosResult in all the namespaces. |
Metrics Name | litmuschaos_cluster_scoped_experiments_run_count |
---|---|
Description | It contains the total experiments run count in all the namespaces |
Source | Aggregated sum of all the experiments runs in all the namespaces |
Sample Metrics | litmuschaos_cluster_scoped_experiments_run_count 2 |
Notes | The litmuschaos_cluster_scoped_experiments_run_count defines the total experiment runs across the cluster. It is summation of litmuschaos_passed_experiments + litmuschaos_failed_experiments + litmuschaos_awaited_experiments for every ChaosResult present inside all the namespaces. |
Metrics Name | litmuschaos_cluster_scoped_experiments_installed_count |
---|---|
Description | It contains the total unique experiments installed/run in all the namespaces |
Source | It contains total unique experiments count in all the namespaces |
Sample Metrics | litmuschaos_cluster_scoped_experiments_installed_count 1 |
Notes | The litmuschaos_cluster_scoped_experiments_installed_count defines the total unique experiments installed/run across the cluster. It is equal to the total number of ChaosResult present inside all the namespaces. |
- Follow the steps described here to start running litmus chaos experiments ans storing chaos results. The chaos custom resources are used by the exporter to generate metrics.
-
Run the exporter container (litmuschaos/chaos-exporter:ci) on host network. It is necessary to mount the kubeconfig & override entrypoint w/
./exporter -kubeconfig <path>
-
Execute
curl 127.0.0.1:8080/metrics
to view metrics
-
Install the RBAC (serviceaccount, role, rolebinding) as per deploy/rbac.md
-
Deploy the chaos-exporter.yaml
-
From a cluster node, execute
curl <exporter-service-ip>:8080/metrics
# HELP litmuschaos_awaited_experiments Total number of awaited experiments
# TYPE litmuschaos_awaited_experiments gauge
litmuschaos_awaited_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 0
# HELP litmuschaos_cluster_scoped_awaited_experiments Total number of awaited experiments in all namespaces
# TYPE litmuschaos_cluster_scoped_awaited_experiments gauge
litmuschaos_cluster_scoped_awaited_experiments 0
# HELP litmuschaos_cluster_scoped_experiments_installed_count Total number of experiments in all namespaces
# TYPE litmuschaos_cluster_scoped_experiments_installed_count gauge
litmuschaos_cluster_scoped_experiments_installed_count 1
# HELP litmuschaos_cluster_scoped_experiments_run_count Total experiments run in all namespaces
# TYPE litmuschaos_cluster_scoped_experiments_run_count gauge
litmuschaos_cluster_scoped_experiments_run_count 2
# HELP litmuschaos_cluster_scoped_failed_experiments Total number of failed experiments in all namespaces
# TYPE litmuschaos_cluster_scoped_failed_experiments gauge
litmuschaos_cluster_scoped_failed_experiments 0
# HELP litmuschaos_cluster_scoped_passed_experiments Total number of passed experiments in all namespaces
# TYPE litmuschaos_cluster_scoped_passed_experiments gauge
litmuschaos_cluster_scoped_passed_experiments 2
# HELP litmuschaos_experiment_chaos_injected_time chaos injected time of the experiments
# TYPE litmuschaos_experiment_chaos_injected_time gauge
litmuschaos_experiment_chaos_injected_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618426086e+09
# HELP litmuschaos_experiment_end_time end time of the experiments
# TYPE litmuschaos_experiment_end_time gauge
litmuschaos_experiment_end_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618426108e+09
# HELP litmuschaos_experiment_start_time start time of the experiments
# TYPE litmuschaos_experiment_start_time gauge
litmuschaos_experiment_start_time{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 1.618426056e+09
# HELP litmuschaos_failed_experiments Total number of failed experiments
# TYPE litmuschaos_failed_experiments gauge
litmuschaos_failed_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 0
# HELP litmuschaos_passed_experiments Total number of passed experiments
# TYPE litmuschaos_passed_experiments gauge
litmuschaos_passed_experiments{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 2
# HELP litmuschaos_probe_success_percentage ProbeSuccesPercentage for the experiments
# TYPE litmuschaos_probe_success_percentage gauge
litmuschaos_probe_success_percentage{chaosengine_context="test",chaosengine_name="helloservice-pod-delete",chaosresult_name="helloservice-pod-delete-pod-delete",chaosresult_namespace="litmus"} 100