Skip to content

Commit

Permalink
docs: add notes about metrics + error logging (#747)
Browse files Browse the repository at this point in the history
  • Loading branch information
aalu1418 committed Jun 14, 2024
1 parent 69b95ad commit 5ca1e35
Showing 1 changed file with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions docs/relay/MetricsLogging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Metrics And Logging

> :warning: NOTE: this document serves as a starting point for debugging and does not provide an exhaustive/definitive answer
The relay exports metrics and chain-specific errors. This document identifies common metrics/logs and potential reasons for behavior.

## Error Logging

[`failed to enqeue tx for simulation`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L129)

* indicates slow RPCs that are not responding quickly enough

[`original signature does not match retry signature`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L301)

* this could indicate a race condition within the relayer code (please alert developers for investigation)

[`failed to find transaction within confirm timeout`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L372)

* indicates network congestion or poor RPC performance (tx dropped)

[`simulate: unrecognized error`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L494)

* There is usually an additional output within the result parameter of the error:
* `InsufficientFundsForRent`: sender balance too low
* `AccountNotFound`: sender or used account does not exist (if previously existed, could have been garbage collected)
* Additional errors + reasons can be found here: https://github.com/solana-labs/solana/blob/master/sdk/src/transaction/error.rs

[`failed to enqeue tx`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L528)

* indicates slow RPC which does not respond quickly enough to keep up with the incoming stream of transactions

[`error in ReadAnswer: stale answer data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/transmissions_cache.go#L110C21-L110C98)

* indicates RPC issues (most likely down)

[`error in ReadState: stale state data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/state_cache.go#L114C21-L114C96)

* indicates RPC issues (most likely down)

## Metrics

[`solana_balance`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L14)

* provides the SOL balance for keys in the keystore
* low SOL balance will lead to the CL node stop transmitting

[`solana_cache_last_update_unix`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L18)

* tracks last update to cached data (unix timestamp)
* updates should occur at the configured rate (default: 1s), slower updates can indicate RPC latency issues

[`solana_client_latency_ms`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L23)

* tracks duration of each RPC request, separated via label + URLs
* spikes in latency can indicate RPC issues

[`solana_txm_tx_success`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L10)

* total of TXs that are confirmed and successfully executed on chain
* this value should consistently increase. If it does not, this could indicate RPC latency or funding issues.

[`solana_txm_tx_pending`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L16)

* current TXs that are inflight (not confirmed success or error)
* this value should stay mostly constant - spikes could indicate lagging performance due to slow RPCs.

[`solana_txm_tx_error`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L22)

* sum of TXs that have errored for any reason
* depending on the network configuration, this value should either be constant or increase

[`solana_txm_tx_error_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L26)

* total of TXs that have been confirmed but error with a revert
* depending on the network configuration, this value should either be constant or increase

[`solana_txm_tx_error_reject`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L30)

* total of TXs that have been immediately rejected by the RPC
* value should be near zero, TXs should not be immediately rejected by the RPC. this could indicate faulty RPC or

[`solana_txm_tx_error_drop`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L34)

* total of TXs that have been broadcast to the network but was not confirmed within the configured timeout
* an increasing value can indicate RPC latency issues or network congestion

[`solana_txm_tx_error_sim_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38)

* total of TXs that reverted during simulation
* value should not increase rapidly and should be low, if it does it may indicate misconfiguration on the CL node or onchain

[`solana_txm_tx_error_sim_other`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38)

* total of TXs that failed during simulation with an unrecognized error
* value should not increase rapdily and should be low, requires looking through logs for the unrecognized error and diagnosing further from there

0 comments on commit 5ca1e35

Please sign in to comment.