Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: telemetry spans #1670

Merged
merged 1 commit into from
Dec 30, 2024
Merged

fix: telemetry spans #1670

merged 1 commit into from
Dec 30, 2024

Conversation

avilagaston9
Copy link
Collaborator

Fix Telemetry Spans

Motivation

We found that sometimes our Batcher tries to cancel batches that were actually included in the net, calling the batcherTaskCreationFailed endpoint, which finalizes the trace and prevents the Aggregator from registering its spans in the trace.

Description

  • Stops finalizing the trace when a batcherTaskCreationFailed occurs.

Observations

On a real batcherTaskCreationFailed, the Aggregator won't receive the new task, and the trace will remain unfinished. Furthermore, the trace metadata won't be removed from the Telemetry server store. Despite that, we will be able to visualize the orphans spans with a warning that their parent ID is invalid.

#1477 was created to address this issue.

How To Test

  1. Check that everything is working normally:

Run anvil, all Aligned components with one or more operators and start telemetry:

make telemetry_full_start

Go to jaeger and explore the generated traces:

image

  1. Test the scenario addressed in this PR:

Change the Batcher create_new_task_retryable function in batcher/aligned-batcher/src/retry/batcher_retryables.rs:165 to return an error after receiving the receipt:

 // timeout to prevent a deadlock while waiting for the transaction to be included in a block.
    let _result = timeout(Duration::from_millis(transaction_wait_timeout), pending_tx)
        .await
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .map_err(|e| {
            warn!("Error while waiting for batch inclusion: {e}");
            RetryError::Permanent(BatcherError::ReceiptNotFoundError)
        })?
        .ok_or(RetryError::Permanent(BatcherError::ReceiptNotFoundError));
    Err(RetryError::Permanent(BatcherError::ReceiptNotFoundError))

Then, start all components again and you should be able to see the Aggregator spans even when the Batcher sends Batcher - Task Creation Failed

image

  1. Test a real cancel scenario:

Remove the hole content of the Batcher create_new_task_retryable function in batcher/aligned-batcher/src/retry/batcher_retryables.rs:105 and return an error without creating any task:

pub async fn create_new_task_retryable(
    batch_merkle_root: [u8; 32],
    batch_data_pointer: String,
    proofs_submitters: Vec<Address>,
    fee_params: CreateNewTaskFeeParams,
    transaction_wait_timeout: u64,
    payment_service: &BatcherPaymentService,
    payment_service_fallback: &BatcherPaymentService,
) -> Result<TransactionReceipt, RetryError<BatcherError>> {
    Err(RetryError::Permanent(BatcherError::ReceiptNotFoundError))
}

You should only be able to see the Batcher spans:

image

Type of change

  • Bug fix

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@avilagaston9 avilagaston9 marked this pull request as ready for review December 20, 2024 20:33
@avilagaston9 avilagaston9 mentioned this pull request Dec 20, 2024
15 tasks
@avilagaston9 avilagaston9 self-assigned this Dec 20, 2024
Copy link
Collaborator

@JulianVentura JulianVentura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works as expected

@JuArce JuArce added this pull request to the merge queue Dec 30, 2024
Merged via the queue into staging with commit f73d455 Dec 30, 2024
1 check passed
@JuArce JuArce deleted the fix-telemetry-spans branch December 30, 2024 14:04
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
PatStiles pushed a commit that referenced this pull request Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants