Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Main Metrics Readme #278

Merged
merged 5 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cmd/config/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ var (
// ----------------------Metrics Config----------------------- //
// ----------------------------------------------------------- //
Metrics: config.MetricsConfig{},
UpdateInterval: 1500 * time.Millisecond,
UpdateInterval: 500 * time.Millisecond,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowering the interval since dYdX block time is less than 1s

MaxPriceAge: 2 * time.Minute,
Providers: []config.ProviderConfig{
// ----------------------------------------------------------- //
Expand Down
9 changes: 9 additions & 0 deletions cmd/oracle/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,12 @@
}
}

logger.Info(
"successfully read in configs",
zap.String("oracle_config_path", oracleCfgPath),
zap.String("market_config_path", marketCfgPath),
)

Check warning on line 135 in cmd/oracle/main.go

View check run for this annotation

Codecov / codecov/patch

cmd/oracle/main.go#L131-L135

Added lines #L131 - L135 were not covered by tests

metrics := oraclemetrics.NewMetricsFromConfig(cfg.Metrics)

// Define the orchestrator and oracle options. These determine how the orchestrator and oracle are created & executed.
Expand All @@ -152,6 +158,8 @@

orchestratorOpts = append(orchestratorOpts, customOrchestratorOps...)
oracleOpts = append(oracleOpts, customOracleOpts...)
} else {
logger.Warn("no custom orchestrator or oracle options for chain; running default version")

Check warning on line 162 in cmd/oracle/main.go

View check run for this annotation

Codecov / codecov/patch

cmd/oracle/main.go#L161-L162

Added lines #L161 - L162 were not covered by tests
}

// Create the orchestrator and start the orchestrator.
Expand Down Expand Up @@ -227,6 +235,7 @@
metrics oraclemetrics.Metrics,
) ([]orchestrator.Option, []oracle.Option, error) {
// dYdX uses the median index price aggregation strategy.
logger.Info("running dYdX sidecar; adding custom options for orchestrator and oracle")

Check warning on line 238 in cmd/oracle/main.go

View check run for this annotation

Codecov / codecov/patch

cmd/oracle/main.go#L238

Added line #L238 was not covered by tests
aggregator, err := oraclemath.NewMedianAggregator(
logger,
marketCfg,
Expand Down
2 changes: 1 addition & 1 deletion contrib/images/slinky.sidecar.e2e.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ EXPOSE 8080
EXPOSE 8002

COPY --from=builder /src/slinky/build/* /usr/local/bin/
COPY --from=builder /src/slinky/config/local /etc/slinky/default_config
aljo242 marked this conversation as resolved.
Show resolved Hide resolved
COPY --from=builder /src/slinky/config/dydx /etc/slinky/default_config
RUN apt-get update && apt-get install ca-certificates -y

WORKDIR /usr/local/bin/
Expand Down
7 changes: 4 additions & 3 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,15 @@ services:
context: .
dockerfile: contrib/images/slinky.sidecar.e2e.Dockerfile
volumes:
- ./config/local/oracle.json:/oracle/oracle.json
- ./config/local/market.json:/oracle/market.json
- ./config/dydx/oracle.json:/oracle/oracle.json
- ./config/dydx/market.json:/oracle/market.json
Comment on lines +13 to +14
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the docker-compose used in again? e2e tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing, just running locally

entrypoint: [
"oracle",
"--oracle-config-path", "/etc/slinky/default_config/oracle.json",
"--market-config-path", "/etc/slinky/default_config/market.json",
"--chain-id", "dydx-mainnet-1",
"--pprof-port", "6060",
"--run-pprof", "true",
"--run-pprof",
]
ports:
- "8080:8080" # main oracle port
Expand Down
226 changes: 226 additions & 0 deletions metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# Side-Car Metrics
davidterpay marked this conversation as resolved.
Show resolved Hide resolved

This document describes the various instrumentation points that are available in the side-car. This should be utilized to monitor the health of the side-car and the services it is proxying.

If there are any additional metrics that you would like to see, please open an issue in the [GitHub repository](https://github.com/skip-mev/slinky). Note that this document is not fully comprehensive and may be updated in the future. However, it should provide a good starting point for monitoring the side-car.

# Table of Contents

* [Metrics](#metrics)
* [Health Metrics](#health-metrics)
* [Prices Metrics](#prices-metrics)
* [Price Feed Metrics](#price-feed-metrics)
* [Aggregated Price Metrics](#aggregated-price-metrics)
* [HTTP Metrics](#http-metrics)
* [WebSocket Metrics](#websocket-metrics)

# Metrics

> **Definitions**:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

>
> * **Market**: A market is a pair of assets that are traded against each other. For example, the BTC-USD market is the market where Bitcoin is traded against the US Dollar.
> * **Price Feed**: A price feed is indexed by a price provider and a market. For example, the Coinbase API provides a price feed for the BTC-USD market.
> * **Price Provider**: A price provider is a service that provides price data for a given market. For example, the Coinbase API is a price provider for the Coinbase markets.
> * **Market Map Provider**: A market map provider is a service that supplies the markets that the side-car needs to fetch data for.

The side-car exposes metrics on the `/metrics` endpoint. These metrics are in the Prometheus format and can be scraped by Prometheus or any other monitoring system that supports Prometheus format.
davidterpay marked this conversation as resolved.
Show resolved Hide resolved

## Health Metrics

There are three primary health metrics that are exposed by the side-car:

* [`side_car_health_check_system_updates_total`](#side_car_health_check_system_updates_total): This metric is a counter that increments every time the side-car updates its internal state. This is a good indicator of the side-car's overall health.
* [`side_car_health_check_ticker_updates_total`](#side_car_health_check_ticker_updates_total): This metric is a counter that increments every time the side-car updates the price of a given market. This is a good indicator of the overall health of a given market.
* [`side_car_health_check_provider_updates_total`](#side_car_health_check_provider_updates_total): This metric is a counter that increments everytime the side-car utilizes a given providers market data. This is a good indicator of the health of a given provider. Note that providers may not be responsible for every market. However, the side-car correctly tracks the number of expected updates for each provider.

### `side_car_health_check_system_updates_total`
aljo242 marked this conversation as resolved.
Show resolved Hide resolved

This metric should be monotonically increasing. Specifically, the rate of this metric should be inversely correlated to the configured `UpdateInterval` in the oracle side-car configuration (`oracle.json`). To check this, you can run the following query in Prometheus:
davidterpay marked this conversation as resolved.
Show resolved Hide resolved

```promql
rate(side_car_health_check_system_updates_total[5m])
```

![Architecture Overview](./resources/side_car_health_check_system_updates_total_rate.png)


### `side_car_health_check_ticker_updates_total`

This should be a monotonically increasing counter for each market. Each market's counter should be relatively close to the `side_car_health_check_system_updates_total` counter.

To verify that the rate of updates for each market is as expected, you can run the following query in Prometheus:

```promql
rate(side_car_health_check_ticker_updates_total[5m])
```

![Architecture Overview](./resources/side_car_health_check_ticker_updates_total_rate.png)

### `side_car_health_check_provider_updates_total`

This metric should be monotonically increasing for each (provider, market) pair. To verify that the rate of updates for each provider is as expected, you can run the following query in Prometheus:

```promql
rate(side_car_health_check_provider_updates_total{provider="coinbase_api", success="true"}[5m])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about why the success rate over 5m is like 2/3 for all of the providers in these examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to update so that we use the rate described above and follow up with explanations on the examples.

```

![Architecture Overview](./resources/side_car_health_check_provider_updates_total_rate.png)

### Health Metrics Summary

In summary, the health metrics should be monitored to ensure that the side-car is updating its internal state, updating the price of each market, and fetching data from the price providers as expected. The rate of updates for each of these metrics should be inversely correlated with the `UpdateInterval` in the oracle side-car configuration.
aljo242 marked this conversation as resolved.
Show resolved Hide resolved

## Prices Metrics

The side-car exposes various metrics related to market prices. These metrics are useful for monitoring the health of the price feeds and the aggregation process.

### Price Feed Metrics

The following price feed metrics are available to operators:

* [`side_car_provider_price`](#side_car_provider_price): The last recorded price for a given price feed.
* [`side_car_provider_last_updated_id`](#side_car_provider_last_updated_id): The last UNIX timestamp for a given price feed.

#### `side_car_provider_price`

This metric represents the last recorded price for a given price feed. The metric is indexed by the provider and market. For example, if we want to check the last recorded price of the BTC-USD market from the Coinbase API, we can run the following query in Prometheus:

```promql
side_car_provider_price{provider="coinbase_api", id="btc/usd"}
```

![Architecture Overview](./resources/side_car_provider_price_coinbase.png)

Alternatively, if we wanted to check that last recorded price of the BTC-USD market across all price providers, we can run the following query in Prometheus:
aljo242 marked this conversation as resolved.
Show resolved Hide resolved

```promql
side_car_provider_price{id="btc/usd"}
```

![Architecture Overview](./resources/side_car_provider_price_all.png)

#### `side_car_provider_last_updated_id`

This metric represents the last recorded timestamp for a given price feed. The metric is indexed by the provider and market. All prices are UNIX timestamped. For example, if we want to check the last recorded timestamp of the BTC-USD market from the Coinbase API, we can run the following query in Prometheus:

```promql
side_car_provider_last_updated_id{provider="coinbase_api", id="btc/usd"}
```

![Architecture Overview](./resources/side_car_provider_last_updated_id.png)

Alerts can be configured based on the age of the last recorded price. For example, if the last recorded price is older than a certain threshold, an alert can be triggered. We recommend a threshold of 5 minutes for most use cases.

### Aggregated Price Metrics

The following aggregated price metrics are available to operators:

* [`side_car_aggregated_price`](#side_car_aggregated_price): The aggregated price for a given market. This provides the final price that can be consumed by a client.

#### `side_car_aggregated_price`

This metric represents the aggregated price for a given market. Prices are aggregated across all available price feeds for a given market. The metric includes the number of decimal places for the price - which can be used to quickly identify if the price is being aggregated correctly. For example, if we want to check the aggregated price of the BTC-USD market, we can run the following query in Prometheus:

```promql
side_car_aggregated_price{id="btc/usd"}
```

![Architecture Overview](./resources/side_car_aggregated_price_btc_usd.png)

This can also be graphed for a given market to visualize the price over time.

![Architecture Overview](./resources/side_car_aggregated_price.png)

### Prices Metrics Summary

In summary, the price feed metrics should be monitored to ensure that prices look reasonable and are being updated as expected. The
`side_car_provider_price` metrics can be used to check that the `side_car_aggregated_price` is being calculated correctly. Additionally, alerts can be set up based on the age of the last recorded price to ensure that prices are being updated in a timely manner.

## HTTP Metrics

The side-car exposes various metrics related to HTTP requests made by the side-car - including the number of requests, the response time, and the status code. These metrics can be used to monitor the health of the side-car's HTTP endpoints.

The following HTTP metrics are available to operators:

* [`side_car_api_http_status_code`](#side_car_api_http_status_code): The status codes of the HTTP response made by the side-car.
* [`side_car_api_response_latency_bucket`](#side_car_api_response_latency_bucket): The response latency of the HTTP requests made by the side-car.

### `side_car_api_http_status_code`

This metric represents the status codes of the HTTP responses made by the side-car. For example, if we want to check the status codes of the HTTP responses made by the side-car for Coinbase, we can run the following query in Prometheus:

```promql
side_car_api_http_status_code{provider="coinbase_api"}
```

![Architecture Overview](./resources/side_car_api_http_status_code_coinbase.png)

Simple queries and alerts can be configured based on the status codes to ensure that the side-car is responding as expected.

### `side_car_api_response_latency_bucket`

This metric represents the response latency of the HTTP requests made by the side-car. The metric is indexed by the provider. For example, if we want to check the response latency of the HTTP requests made by the side-car for Coinbase, we can run the following query in Prometheus:

```promql
side_car_api_response_latency_bucket{provider="coinbase_api"}
```

![Architecture Overview](./resources/side_car_api_response_latency_bucket_coinbase.png)

This can be used to monitor the response time of the side-car's HTTP endpoints and set up alerts based on the response time.

### HTTP Metrics Summary

In summary, the HTTP metrics should be monitored to ensure that the side-car's HTTP endpoints are responding as expected. The `side_car_api_http_status_code` metrics can be used to check the status codes of the HTTP responses, and the `side_car_api_response_latency_bucket` metrics can be used to monitor the response time of the HTTP requests.

## WebSocket Metrics

The side-car exposes various metrics related to WebSocket connections made by the side-car. These metrics can be used to monitor the health of the side-car's WebSocket connections. The following WebSocket metrics are available to operators:

* [`side_car_web_socket_connection_status`](#side_car_web_socket_connection_status): This includes various metrics related to the WebSocket connections made by the side-car.
* [`side_car_web_socket_data_handler_status`](#side_car_web_socket_data_handler_status): This includes various metrics related to whether WebSocket messages are being correctly handled by the side-car.
* [`side_car_web_socket_response_time_bucket`](#side_car_web_socket_response_time_bucket): This includes the response time of the WebSocket messages received by the side-car.

### `side_car_web_socket_connection_status`

This metric includes various metrics related to the WebSocket connections made by the side-car. Specifically, this includes the number of reads, writes, dials, and errors for each connection. For example, if we wanted to check these metrics for the Coinbase WebSocket connection, we can run the following query in Prometheus:

```promql
side_car_web_socket_connection_status{provider="coinbase_ws"}
```

![Architecture Overview](./resources/side_car_web_socket_connection_status_coinbase.png)

### `side_car_web_socket_data_handler_status`

This metric includes various metrics related to whether WebSocket messages are being correctly handled by the side-car. Specifically, this includes the number of messages that were correctly handled, how many heartbeats were sent, and more. For example, if we wanted to check these metrics for the Coinbase WebSocket connection, we can run the following query in Prometheus:

```promql
side_car_web_socket_data_handler_status{provider="coinbase_ws}
```

![Architecture Overview](./resources/side_car_web_socket_data_handler_status_coinbase.png)

The most important statuses to monitor here are `handle_message_success` and `heart_beat_success`. These metrics should be close to the total number of messages and heartbeats sent by the WebSocket connection.

### `side_car_web_socket_response_time_bucket`

This metric includes the response time of the WebSocket messages received by the side-car. Specifically, this includes the time it took to receive a new message and process it. For example, if we wanted to check the response time for the Coinbase WebSocket connection, we can run the following query in Prometheus:

```promql
side_car_web_socket_response_time_bucket{provider="coinbase_ws"}
```

![Architecture Overview](./resources/side_car_web_socket_response_time_bucket_coinbase.png)

This can be used to monitor the response time of the WebSocket messages received by the side-car and set up alerts based on the response time. We recommend alerts be set up if the response time exceeds a threshold of 5 minutes.

### WebSocket Metrics Summary

In summary, the WebSocket metrics should be monitored to ensure that the side-car's WebSocket connections are functioning as expected. The `side_car_web_socket_connection_status` metrics can be used to check the number of read, write, and dial errors, the `side_car_web_socket_data_handler_status` metrics can be used to check that messages are being correctly handled, and the `side_car_web_socket_response_time` metrics can be used to monitor the response time of the WebSocket messages.

# Conclusion

This document has provided an overview of the various metrics that are available in the side-car. These metrics can be used to monitor the health of the side-car and the services it is proxying. By monitoring these metrics, operators can ensure that the side-car is functioning as expected and take action if any issues arise.


73 changes: 0 additions & 73 deletions oracle/metrics/README.md

This file was deleted.

Loading
Loading