Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 5, 2025 2:08am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 5, 2025 2:08am
rivet-inspector Ignored Ignored Preview Nov 5, 2025 2:08am
rivet-site Ignored Ignored Preview Nov 5, 2025 2:08am

Copy link
Contributor Author

MasterPtato commented Nov 4, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 4, 2025

PR Review: Fix All Dashboards (#3358)

Summary

This PR makes comprehensive improvements to metrics, observability infrastructure, and Grafana dashboards across the Rivet platform. The changes include instrumentation updates, OpenTelemetry configuration, and dashboard fixes.

Code Quality & Best Practices

Strengths ✅

  1. Excellent Metric Instrumentation Design

    • Well-structured histogram buckets in packages/metrics/src/buckets.rs with appropriate ranges for different operation types
    • Comprehensive metrics coverage across API, workflow engine (gasoline), and Pegboard components
    • Good use of metric attributes for filtering (e.g., workflow_name, activity_name, error_code)
  2. Runtime-Adjustable Trace Sampling

    • The DynamicSampler implementation in packages/metrics/src/providers.rs allows runtime sampling ratio changes without restart
    • Parent-based sampling strategy follows OpenTelemetry best practices
    • Thread-safe implementation using Arc<RwLock<f64>>
  3. Proper Structured Logging

    • Consistent use of tracing with proper attributes throughout (e.g., packages/api-builder/src/middleware.rs:62-69)
    • Ray ID propagation for distributed tracing
    • Appropriate log levels (error for 5xx, info for 4xx, debug for others)
  4. Infrastructure as Code

    • Automated Grafana provisioning in docker templates
    • OpenTelemetry Collector configuration with proper batching, compression, and retry logic

Issues & Recommendations

Critical Issue ⚠️

API_REQUEST_PENDING Metric Type (packages/api-builder/src/middleware.rs:102-108, 189-195)

The current implementation uses additive counters with +1/-1:

metrics::API_REQUEST_PENDING.add(1, &[...]);  // Line 102
// ... later ...
metrics::API_REQUEST_PENDING.add(-1, &[...]);  // Line 189

Problem: This is semantically incorrect. Counters in OpenTelemetry should be monotonically increasing. The current approach can lead to:

  • Incorrect aggregation across multiple service instances
  • Loss of accuracy during metric resets
  • Confusion in dashboard visualizations

Recommendation: Change API_REQUEST_PENDING to use UpDownCounter<i64> instead of Counter<f64>. This type is specifically designed for values that go up and down (like in-flight requests).

Medium Priority Issues 🔧

  1. Low Default Sampling Ratio (packages/metrics/src/providers.rs:102)

    • Default: 0.001 (0.1%)
    • Issue: May miss important traces during debugging/troubleshooting
    • Recommendation: Increase to at least 0.01 (1%) for production, or make environment-specific
  2. Empty String for Success Cases (packages/api-builder/src/middleware.rs:197-203)

    let error_code: String = if status.is_success() {
        String::new()  // Empty string for success
    }
    • Issue: Using empty string for "no error" can complicate queries
    • Recommendation: Consider using a separate boolean attribute has_error or omit the error_code attribute entirely for successful requests
  3. Potential Cardinality Concerns

    • The path attribute in API metrics could lead to high cardinality if paths contain dynamic segments
    • Recommendation: Ensure paths are normalized (e.g., /users/:id not /users/12345)

Performance Considerations

Good Decisions 🚀

  1. Batch Processing: 5s timeout and 10k batch size in OTEL collector balances latency vs throughput
  2. Micro Buckets: Using MICRO_BUCKETS for actor allocation (sub-millisecond precision) is appropriate for fast operations
  3. Async ClickHouse Inserts: Reduces write latency impact on hot path
  4. Compression: LZ4 compression enabled for efficient storage

Optimization Opportunities

  1. Metric Export Interval: 30-second interval is reasonable, but consider making it configurable for different environments
  2. TTL Configuration: 72-hour TTL for metrics is good for dev, but consider longer retention for production

Security Concerns

Development Environment Only ⚠️

  • Anonymous admin access in Grafana (packages/template/src/services/core/grafana.ts) is acceptable for dev but ensure this is NOT deployed to production
  • OTEL collector allows unauthenticated connections on port 4317 - acceptable for internal dev environments only

Test Coverage

Observation: No test changes included in this PR

Recommendations:

  1. Add integration tests for metric collection to ensure:
    • Metrics are emitted with correct attributes
    • Histogram buckets capture expected ranges
    • Error metrics increment only for actual errors
  2. Consider adding tests for the DynamicSampler to verify runtime ratio updates work correctly

Documentation

Suggestions:

  1. Document expected metric cardinality for high-volume metrics
  2. Add examples of common Grafana queries for the new metrics
  3. Document the metric lifecycle and deprecation strategy

Specific File Comments

packages/gasoline/src/ctx/workflow.rs

Good separation of recoverable vs non-recoverable errors for metrics - prevents noise in error dashboards.

packages/pegboard/src/metrics.rs

Clean, minimal metric definitions with clear descriptions. Good use of MICRO_BUCKETS for allocation timing.

packages/metrics/src/providers.rs

Excellent implementation of dynamic sampling. Consider logging the initial ratio on startup for observability.

Conclusion

This is a high-quality observability infrastructure improvement. The main concern is the incorrect use of Counter for the pending requests metric, which should be addressed before merge. The low default sampling ratio should also be reconsidered. Otherwise, the implementation follows OpenTelemetry best practices and establishes comprehensive monitoring across the platform.

Recommendation: Request changes to fix the API_REQUEST_PENDING metric type, then approve.


Reviewed with Claude Code

@claude
Copy link

claude bot commented Nov 5, 2025

Code Review - PR #3358: Fix all dashboards

Summary

This PR fixes Grafana dashboard configurations and OpenTelemetry metrics collection. The changes update dashboard JSON files, adjust OpenTelemetry collector configuration to include metrics pipelines, and refine histogram bucket boundaries for better metrics granularity.


Positive Aspects

  1. Comprehensive Dashboard Updates: All Grafana dashboards across multiple deployment configurations have been updated consistently.

  2. Metrics Pipeline Addition: The addition of metrics pipeline to the OTel collector configuration (lines 90-94 in otel-collector-server.ts) ensures proper metrics collection alongside logs and traces.

  3. Improved Histogram Buckets: The refinement of histogram boundaries in buckets.rs adds finer granularity at lower values (0.001, 0.0025) which is beneficial for tracking fast operations.

  4. Template-Based Generation: Changes are made to the template directory and then propagated, following the documented pattern in CLAUDE.md.


Issues and Concerns

1. Metrics Import Change (gasoline/src/ctx/workflow.rs:1)

-use rivet_metrics::KeyValue;
+use rivet_metrics::KeyValue;

The diff shows this as changed, but appears identical. This might be a whitespace issue or artifact. Please verify this is intentional.

2. Duplicate Copyright Comments (metrics/src/providers.rs:1-2)

// Based off of https://github.com/tokio-rs/tracing-opentelemetry/blob/v0.1.x/examples/opentelemetry-otlp.rs
// Based off of https://github.com/tokio-rs/tracing-opentelemetry/blob/v0.1.x/examples/opentelemetry-otlp.rs

There's a duplicate comment line that should be removed.

3. Histogram Bucket Boundary at 0.0 (metrics/src/buckets.rs:3)

pub const BUCKETS: &[f64] = &[
    // For otel
    0.0, // Added
    0.001, 0.0025,
    // ...
];

Including 0.0 as a histogram boundary is unusual and may not provide meaningful data since most operations take non-zero time. Consider if this is necessary or if it could cause issues with histogram calculations. The comment "For otel" suggests this might be an OpenTelemetry requirement - if so, please add clarification.

4. Missing Activity Name in Metrics (pegboard/src/workflows/runner.rs:1140)

metrics::ACTOR_PENDING_ALLOCATION.record(
    pending_actor_count as f64,
    &[
        KeyValue::new("namespace_id", input.namespace_id.to_string()),
        KeyValue::new("runner_name", input.name.to_string()),
    ],
);

This metrics call is inconsistent with the expected attributes documented in pegboard/src/metrics.rs:9 which states:

/// Expected attributes: "namespace_id", "runner_name"

While this matches the expected attributes, it's worth noting this is a gauge metric being recorded multiple times. Ensure this is the intended behavior vs. setting a gauge value once.

5. Histogram Bucket Consistency

The BUCKETS constant is updated but used across many metrics. Ensure that:

  • The new boundaries work well for all use cases (API requests, workflows, activities, etc.)
  • Existing dashboard queries are compatible with the new bucket boundaries
  • Historical data visualization isn't broken by these changes

6. Port Configuration (docker-compose.ts:241)

ports: ["4317:4317"],

The OTel collector server port is exposed on all datacenters. Consider if this should only be exposed on the primary datacenter like other services, or if there's a security consideration here.


Recommendations

  1. Remove duplicate comment in metrics/src/providers.rs:1-2

  2. Clarify 0.0 bucket boundary: Add a comment explaining why 0.0 is included in histogram buckets if it's an OpenTelemetry requirement.

  3. Verify metrics changes: Ensure that changing histogram buckets won't break existing Grafana dashboard queries that may reference specific bucket values.

  4. Consider OTel collector port exposure: Evaluate if port 4317 should be exposed on all DCs or just the primary.

  5. Add tests: Consider adding integration tests to verify:

    • Metrics are properly collected and exported
    • Dashboard JSON is valid
    • OTel collector configuration is parseable
  6. Document bucket rationale: Add a comment in buckets.rs explaining the reasoning behind the specific bucket values chosen (e.g., "Optimized for sub-millisecond operations common in API responses").


Security Considerations

  • ✅ No credentials or secrets exposed
  • ✅ Port exposure appears intentional for metrics collection
  • ⚠️ Consider adding authentication/authorization for OTel collector endpoints in production

Performance Considerations

  • ✅ Histogram bucket changes should improve granularity for performance monitoring
  • ✅ Metrics pipeline addition has minimal overhead
  • ✅ Batch event acknowledgment (line 209-224 in runner.rs) is efficient

Test Coverage

  • ⚠️ No test changes included. Consider adding:
    • Tests for OTel collector configuration validation
    • Tests for histogram bucket coverage of expected value ranges
    • Integration tests for metrics pipeline

Overall Assessment

This is a solid infrastructure improvement that enhances observability. The main concerns are minor (duplicate comments, bucket boundary clarification) and don't block merging. The changes follow the project's template-based configuration pattern correctly.

Recommendation: Approve with minor fixes suggested above.

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Fix All Dashboards

I've reviewed PR #3358 which focuses on fixing Grafana dashboards and adding OpenTelemetry metrics pipeline configuration. Here's my comprehensive feedback:

Summary

This PR makes significant changes to observability infrastructure:

  • 57 files changed (+5,140 / -5,300 lines)
  • Updates Grafana dashboard configurations across all Docker environments
  • Adds metrics pipeline configuration to OpenTelemetry collectors
  • Minor Rust code improvements for metrics consistency

✅ Code Quality & Best Practices

Rust Code Changes

The Rust changes are minimal and follow the codebase conventions well:

  1. engine/packages/pegboard/src/metrics.rs (metrics.rs:1-19) - ✅ Good

    • Simplified metrics definitions by removing redundant metrics
    • Uses MICRO_BUCKETS appropriately for microsecond-level timing
    • Follows lazy_static pattern consistently
  2. engine/packages/pegboard/src/workflows/runner.rs (runner.rs:1140-1146) - ✅ Good

    • Proper structured logging pattern: uses KeyValue::new() with attributes
    • Metric names follow conventions
    • Good use of .to_string() for ID formatting
  3. engine/packages/metrics/src/buckets.rs (buckets.rs:10-14) - ✅ Good

    • MICRO_BUCKETS definition is appropriate for fine-grained measurements
    • Bucket ranges cover 0.1ms to 50s which is reasonable for most operations
    • Comment clearly indicates OTEL requirement for 0.0 bucket
  4. engine/packages/api-builder/src/middleware.rs (middleware.rs:204-212) - ✅ Good

    • Consistent with logging patterns
    • Proper structured metrics with relevant dimensions (method, path, status, error_code)
    • Follows the guideline to use KeyValue for metrics attributes
  5. engine/packages/gasoline/src/metrics.rs (metrics.rs:1-4) - ✅ Good

    • Imports follow workspace conventions
    • Uses BUCKETS consistently across histogram definitions

Configuration Changes

  1. OpenTelemetry Collector Config - ✅ Good

    • Added metrics pipeline alongside existing logs/traces pipelines
    • Proper ClickHouse exporter configuration with:
      • Async inserts for performance
      • LZ4 compression
      • Reasonable retry strategy (5s initial, 30s max, 300s timeout)
      • 72h TTL for data retention
    • Batch processor with sensible defaults (5s timeout, 10k batch size)
  2. Grafana Configuration - ⚠️ Note

    • Changed default home dashboard from undefined to api.json
    • Dashboards updated extensively (not reviewed in detail due to size)
    • OTEL version set to 1.2.9 in datasource config

🔍 Potential Issues

Minor Concerns:

  1. Metrics Bucket Consistency (metrics/src/buckets.rs:10-14)

    • The MICRO_BUCKETS range goes up to 50 seconds, which seems unusually high for "micro" operations
    • Consider: Is this intentional? Operations taking 50s might deserve different bucketing or could indicate problems worth highlighting separately
  2. Missing Context in PR Description

    • The PR has an empty body, making it unclear what specific dashboard issues were fixed
    • Recommendation: Add a description explaining what was broken and how this fixes it
  3. Regenerated Files Without Template Changes

    • Most changes are in generated docker/dev*/ directories
    • Only 2 source template files changed: otel-collector-server.ts and grafana.ts
    • This is correct per CLAUDE.md, but worth confirming all regeneration was intentional

🔒 Security Considerations

No security issues identified:

  • Password correctly uses environment variable interpolation: ${env:CLICKHOUSE_PASSWORD}
  • No hardcoded credentials
  • Default Grafana admin credentials are documented as dev-only
  • Anonymous auth is explicitly enabled for dev environments (appropriate for local development)

⚡ Performance Considerations

Good performance practices:

  • Batch processing configured (5s timeout, 10k batch size) reduces network overhead
  • Async inserts to ClickHouse minimize blocking
  • LZ4 compression reduces storage and network transfer costs
  • TTL (72h) prevents unbounded data growth

Potential optimization:

  • Consider whether 30s metric export interval (providers.rs:129) is appropriate for your use case
    • Higher intervals reduce overhead but increase metric delay
    • Current setting seems reasonable for development

🧪 Test Coverage

⚠️ No tests included

  • Configuration changes typically don't have unit tests
  • Recommendation: Consider adding integration tests that verify:
    • OTEL collector successfully receives and exports metrics
    • Grafana can query the new metrics tables
    • Dashboards render without errors

📝 Recommendations

  1. High Priority:

    • Add a PR description explaining what dashboard issues were fixed
    • Verify that regenerating all docker configs was intentional
  2. Medium Priority:

    • Consider reviewing the MICRO_BUCKETS range - 50s seems high for micro-level operations
    • Add integration tests for the observability pipeline if not already present elsewhere
  3. Low Priority:

    • Consider documenting the OTEL metrics pipeline in your observability docs
    • The grafana.ts file has a duplicate import comment (lines 1-2) that could be cleaned up

✨ Positive Highlights

  • Follows CLAUDE.md conventions perfectly (updated template, regenerated configs)
  • Rust code follows structured logging patterns correctly
  • Metrics properly use histogram buckets and dimensions
  • Configuration is well-structured and maintainable
  • Good separation of concerns between templates and generated configs

Final Verdict

Approve with minor recommendations

The code quality is solid and follows project conventions well. The main gaps are documentation (PR description) and test coverage. The changes are safe to merge but would benefit from the documentation improvements mentioned above.

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review

Overview

This PR titled "fix: fix all dashboards" makes changes to metrics collection and Grafana dashboard configurations. The changes span across:

  • 7 Rust source files (minimal changes)
  • 50 generated configuration files (Grafana dashboards, OTEL collector configs)

Code Quality & Best Practices

✅ Positive Findings

  1. Consistent Metrics Updates: The changes to histogram bucket configurations are consistent across all metrics files:

    • packages/metrics/src/buckets.rs: Updated bucket definitions
    • packages/gasoline/src/metrics.rs: Switched from BUCKETS to using workspace-defined buckets
    • packages/pegboard/src/metrics.rs: Switched from custom buckets to MICRO_BUCKETS
  2. Proper Separation of Concerns: Changes to the OTEL collector configuration are properly made in the template (docker/template/src/services/edge/otel-collector-server.ts) which generates the actual configs, following the CLAUDE.md guidance.

  3. Logging Best Practices: The changes in api-builder/src/middleware.rs properly use structured logging with the ? prefix for debug formatting (lines 143, 163, etc.).

⚠️ Observations & Recommendations

  1. Bucket Configuration Change (packages/metrics/src/buckets.rs:1-8)

    • The BUCKETS array was modified but the changes are subtle. Consider adding a comment explaining the rationale for these specific bucket boundaries.
    • The comment mentions "Added" for certain values but doesn't explain why these specific thresholds were chosen.
  2. OTEL Collector Configuration (docker/template/src/services/edge/otel-collector-server.ts:90-94)

    • Added metrics pipeline to OTEL collector configuration
    • Question: Was metrics collection previously not configured? This seems like a significant addition that might warrant more explanation in the PR description.
  3. Minimal PR Description

    • The PR body is empty. Given the scope of changes (especially to metrics and dashboards), it would be helpful to document:
      • What dashboard issues were being fixed
      • Why bucket configurations needed to change
      • What the metrics pipeline addition enables
      • Any before/after screenshots of the dashboards
  4. gasoline/src/ctx/workflow.rs:995-998

    • The LOOP_ITERATION_DURATION metric uses dt - dt2, which is the overhead time excluding the actual loop body execution
    • This is intentional based on the metric description, but the calculation could benefit from a comment explaining this subtraction
  5. providers.rs Line Duplication (packages/metrics/src/providers.rs:1-2)

    • There appear to be duplicate comment lines at the top of the file (both line 1 and 2 are identical)
    • Minor cleanup opportunity

Testing & Verification

Recommendations:

  1. Verify that all Grafana dashboards render correctly with the new bucket configurations
  2. Test that metrics are properly flowing through the new OTEL collector metrics pipeline
  3. Confirm that histogram buckets appropriately cover the expected value ranges for your metrics

Security Concerns

No security issues identified. The changes are primarily configuration and metrics-related.

Performance Considerations

Minor observations:

  • The new histogram buckets should provide better granularity for metrics tracking
  • The OTEL collector batch processing (10000 events, 5s timeout) seems reasonable but should be monitored for memory usage in production

Summary

This is a clean, focused PR that improves metrics collection and dashboard configuration. The Rust code changes follow project conventions well. The main improvement would be adding context to the PR description about what specific dashboard issues were fixed.

Recommendation: ✅ Approve with minor suggestions

The code quality is good and follows the project's patterns. The suggestions above are mostly about documentation and minor cleanup opportunities rather than blocking issues.

@claude
Copy link

claude bot commented Nov 5, 2025

PR Review: Fix all dashboards

Summary

This PR updates metrics collection and dashboard configurations to improve OpenTelemetry integration. The changes primarily focus on:

  • Adding 0.0 to histogram bucket boundaries for OTEL compatibility
  • Simplifying pegboard metrics definitions
  • Updating OpenTelemetry collector configurations
  • Regenerating Grafana dashboards

Code Quality

Positive:

  • Clean metric histogram bucket updates that maintain backward compatibility
  • Proper use of workspace dependencies and existing patterns
  • Consistent structured logging with tracing spans
  • Good error handling throughout

Suggestions:

  1. Duplicate comment in providers.rs:1-2

    • Lines 1-2 have identical comments. Consider removing the duplicate.
  2. Magic numbers in buckets.rs

    • The bucket values (especially in TASK_POLL_BUCKETS) would benefit from comments explaining the logarithmic progression or calculation basis
    • The comment on line 16 mentions "Calculated based on the LogHistogram configuration" but the values are hard to verify without the source
  3. Error handling in providers.rs:96

    • .unwrap() is used when building the OTLP exporter. Consider handling this more gracefully or documenting why it's safe to panic here

Potential Issues

  1. Missing error context in api-builder/middleware.rs:204-212

    • The metrics recording doesn't handle potential errors from KeyValue creation
    • If error_code formatting fails, the entire metric recording could fail silently
  2. Hardcoded retry configuration in otel-collector-server.ts:70-75

    • Retry intervals and timeouts are hardcoded. Consider making these configurable for different environments
  3. Gasoline metrics consistency in gasoline/metrics.rs

    • Line 17: LAST_PULL_WORKFLOWS_HISTORY_DURATION is u64 while similar metrics like LAST_PULL_WORKFLOWS_DURATION (line 13) are f64
    • This inconsistency could cause confusion. Consider using f64 consistently for duration metrics

Performance Considerations

  1. Metric cardinality in pegboard/workflows/runner.rs:1140-1146

    • The ACTOR_PENDING_ALLOCATION metric uses both namespace_id and runner_name as labels
    • High cardinality metrics (especially with UUIDs) can impact performance at scale
    • Consider if namespace_id is necessary or if aggregation at the runner_name level is sufficient
  2. Batch processing in OpenTelemetry

    • Good use of batch processor in OTEL config with send_batch_size: 10000
    • The 5s timeout is reasonable for development but may need tuning for production

Security Concerns

No major security issues found, but note:

  • Password interpolation in OTEL config uses environment variable correctly (${env:CLICKHOUSE_PASSWORD})
  • All metrics exposure appears to be internal only

Test Coverage

Concern: This PR doesn't include any test updates or additions. Consider:

  • Unit tests for the new histogram bucket configurations
  • Integration tests to verify OTEL metrics are correctly exported
  • Validation that Grafana dashboards correctly query the new metric structure

Additional Notes

  1. Documentation: Consider adding a comment explaining why 0.0 was added to bucket boundaries (OTEL compatibility)

  2. Dashboard changes: The Grafana dashboard changes are extensive (5000+ lines). Ensure these were regenerated from templates and not manually edited

  3. Metric naming: All metrics follow OpenTelemetry naming conventions well (e.g., rivet_gasoline_workflow_total)

Recommendations

Priority:

  • High: Add tests for metric collection and OTEL integration
  • Medium: Fix the inconsistent duration metric types in gasoline
  • Low: Add documentation comments for bucket calculations

Overall, this is a solid infrastructure improvement that enhances observability. The code quality is good, following Rivet's conventions. Main concern is the lack of test coverage for these changes.

Generated with Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 5, 2025

Merge activity

  • Nov 5, 2:20 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 5, 2:21 AM UTC: CI is running for this pull request on a draft pull request (#3367) due to your merge queue CI optimization settings.
  • Nov 5, 2:29 AM UTC: Merged by the Graphite merge queue via draft PR: #3367.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants