diff --git a/docs/relay/MetricsLogging.md b/docs/relay/MetricsLogging.md new file mode 100644 index 000000000..53566671e --- /dev/null +++ b/docs/relay/MetricsLogging.md @@ -0,0 +1,96 @@ +# Metrics And Logging + +> :warning: NOTE: this document serves as a starting point for debugging and does not provide an exhaustive/definitive answer + +The relay exports metrics and chain-specific errors. This document identifies common metrics/logs and potential reasons for behavior. + +## Error Logging + +[`failed to enqeue tx for simulation`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L129) + +* indicates slow RPCs that are not responding quickly enough + +[`original signature does not match retry signature`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L301) + +* this could indicate a race condition within the relayer code (please alert developers for investigation) + +[`failed to find transaction within confirm timeout`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L372) + +* indicates network congestion or poor RPC performance (tx dropped) + +[`simulate: unrecognized error`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L494) + +* There is usually an additional output within the result parameter of the error: + * `InsufficientFundsForRent`: sender balance too low + * `AccountNotFound`: sender or used account does not exist (if previously existed, could have been garbage collected) + * Additional errors + reasons can be found here: https://github.com/solana-labs/solana/blob/master/sdk/src/transaction/error.rs + +[`failed to enqeue tx`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/txm/txm.go#L528) + +* indicates slow RPC which does not respond quickly enough to keep up with the incoming stream of transactions + +[`error in ReadAnswer: stale answer data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/transmissions_cache.go#L110C21-L110C98) + +* indicates RPC issues (most likely down) + +[`error in ReadState: stale state data, polling is likely experiencing errors`](https://github.com/smartcontractkit/chainlink-solana/blob/a2ff2b377b72d06dc85b5242d93bb2f974967145/pkg/solana/state_cache.go#L114C21-L114C96) + +* indicates RPC issues (most likely down) + +## Metrics + +[`solana_balance`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L14) + +* provides the SOL balance for keys in the keystore +* low SOL balance will lead to the CL node stop transmitting + +[`solana_cache_last_update_unix`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L18) + +* tracks last update to cached data (unix timestamp) +* updates should occur at the configured rate (default: 1s), slower updates can indicate RPC latency issues + +[`solana_client_latency_ms`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/monitor/prom.go#L23) + +* tracks duration of each RPC request, separated via label + URLs +* spikes in latency can indicate RPC issues + +[`solana_txm_tx_success`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L10) + +* total of TXs that are confirmed and successfully executed on chain +* this value should consistently increase. If it does not, this could indicate RPC latency or funding issues. + +[`solana_txm_tx_pending`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L16) + +* current TXs that are inflight (not confirmed success or error) +* this value should stay mostly constant - spikes could indicate lagging performance due to slow RPCs. + +[`solana_txm_tx_error`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L22) + +* sum of TXs that have errored for any reason +* depending on the network configuration, this value should either be constant or increase + +[`solana_txm_tx_error_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L26) + +* total of TXs that have been confirmed but error with a revert +* depending on the network configuration, this value should either be constant or increase + +[`solana_txm_tx_error_reject`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L30) + +* total of TXs that have been immediately rejected by the RPC +* value should be near zero, TXs should not be immediately rejected by the RPC. this could indicate faulty RPC or + +[`solana_txm_tx_error_drop`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L34) + +* total of TXs that have been broadcast to the network but was not confirmed within the configured timeout +* an increasing value can indicate RPC latency issues or network congestion + +[`solana_txm_tx_error_sim_revert`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38) + +* total of TXs that reverted during simulation +* value should not increase rapidly and should be low, if it does it may indicate misconfiguration on the CL node or onchain + +[`solana_txm_tx_error_sim_other`](https://github.com/smartcontractkit/chainlink-solana/blob/4ca9bcc8264d89c7527897e729281e13f37852f1/pkg/solana/txm/prom.go#L38) + +* total of TXs that failed during simulation with an unrecognized error +* value should not increase rapdily and should be low, requires looking through logs for the unrecognized error and diagnosing further from there +