Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver participation guard #3257

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open

Solver participation guard #3257

wants to merge 29 commits into from

Conversation

squadgazzz
Copy link
Contributor

@squadgazzz squadgazzz commented Jan 29, 2025

Description

From the original issue:

When a solver repeatedly wins consecutive auctions but fails to settle its solutions on-chain, it can lead to system downtime. To prevent this, the autopilot must have the capability to temporarily exclude such solvers from participating in competitions. This ensures no single solver can disrupt the system's operations.

This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:

  1. Moved the existing Authenticator's is_solver on-chain call into the new struct.
  2. Introduced a new strategy, which finds a non-settling solver using a SQL query. It selects 3 last auctions(configurable) with a deadline until the current block to avoid selecting pending settlements and checks if all of the auctions were settled by the same solver/solvers(in case of multiple winners). This strategy caches the results to avoid redundant DB queries. This query relies on the auction_id column from the settlements table, which gets updated separately by the Observer struct, so the cache gets updated only once the Observer has some result.

These validators are called sequentially to avoid redundant RPC calls to Authenticator. So it first checks for the DB-based validator cache and, only then, sends the RPC call.

Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).

Each validator can be enabled/disabled separately in case of any issue.

Metrics

Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.

Open discussions

  1. Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
    The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.

  2. Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.

How to test

A new SQL query test. Existing e2e tests.

Related Issues

Fixes #3221

Summary by CodeRabbit

  • New Features
     - Introduced advanced solver participation controls with configurable eligibility checks, integrating both on-chain and database validations.
     - Enabled asynchronous real-time notifications for settlement updates, enhancing system responsiveness.
     - Added metrics tracking to monitor auction participation and performance.

  • Chores
     - Updated internal dependencies and restructured driver configuration.
     - Reorganized the database schema to support improved auction and settlement processing.

@squadgazzz squadgazzz changed the title Solver participation validator Solver participation gate Jan 29, 2025
@squadgazzz squadgazzz changed the title Solver participation gate Solver participation guard Jan 29, 2025
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 17:00
@squadgazzz squadgazzz requested a review from a team as a code owner January 29, 2025 17:00
@squadgazzz squadgazzz marked this pull request as draft January 29, 2025 17:01
@squadgazzz squadgazzz marked this pull request as ready for review January 29, 2025 18:00
crates/autopilot/src/database/competition.rs Outdated Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Show resolved Hide resolved
crates/autopilot/src/run_loop.rs Show resolved Hide resolved
crates/autopilot/src/arguments.rs Outdated Show resolved Hide resolved
@squadgazzz squadgazzz force-pushed the blacklist-failing-solvers branch 2 times, most recently from f69e174 to 5fc831e Compare January 30, 2025 20:11
Copy link
Contributor

@mstrug mstrug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach which uses trait based validators list, as it is easily extensible.

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I initially thought about this task, it was much simpler in my head.

Do we really need to do this in async way? Sure it is probably more efficient since we avoid spending time on the critical runloop path, but it does come with a cost of > 600 lines added, coupling this component with settlement::Observer and introducing caching as well.

Setting the review to "blocking" because of the mentioned issue with settlement::Observer. 👇

Why not just:

  1. Runloop starts
  2. Validator calls one infra function that asks the db whether the solver is supposed to be banned (in 'just in time' manner) and the whole logic is implemented synchronously.

I also don't buy the redundant db calls argument since we call this functionality once per auction at most.

Copy link

coderabbitai bot commented Feb 7, 2025

Walkthrough

This pull request introduces several additions and modifications across multiple modules. A new dependency is added in Cargo.toml, and key structures in the autopilot package are updated with extra fields and configuration for managing solver participation. New asynchronous methods are implemented to identify non-settling solvers both at the database level and via SQL queries. A dedicated domain module (SolverParticipationGuard) is created with both on‐chain and database validators to determine solver eligibility. Additionally, settlement observer functionality is enhanced and integrated into the run loop, with a minor comment fix in the driver metrics code.

Changes

File(s) Change Summary
crates/autopilot/{Cargo.toml, src/arguments.rs, src/infra/solvers/mod.rs, src/run.rs, src/run_loop.rs, src/domain/settlement/observer.rs} Added dashmap dependency; updated solver configuration with new fields for participation guard and unsettled blocking; introduced a settlement updates channel and integrated solver participation checks in the run process.
crates/autopilot/src/domain/competition/{mod.rs, participation_guard/*}} Introduced a new SolverParticipationGuard module with dedicated DB and on-chain validators to control solver eligibility based on settlement history.
crates/{autopilot/src/database/competition.rs, database/src/{solver_competition.rs, lib.rs}} Added asynchronous methods to identify non-settling solvers and restructured database TABLES to reflect an updated schema.
crates/driver/src/domain/competition/bad_tokens/metrics.rs Corrected a typographical error in a comment.

Sequence Diagram(s)

sequenceDiagram
  participant RL as RunLoop
  participant SPG as SolverParticipationGuard
  participant DB as DB Validator
  participant OC as Onchain Validator

  RL->>SPG: can_participate(solver)
  SPG->>DB: is_allowed(solver)
  DB-->>SPG: returns allowed (bool)
  alt Further check required
      SPG->>OC: is_allowed(solver)
      OC-->>SPG: returns allowed (bool)
  end
  SPG-->>RL: participation result
Loading

Assessment against linked issues

Objective Addressed Explanation
Kick out solver from competition if not settling [#3221]
Settlement performance tracking via DB and on-chain checks [#3221]

Poem

Hoppy code and dandy bytes,
I’m a rabbit on a coding spree,
Hopping through new fields and lights,
With solvers checked so sprightly and free.
Carrots and commits, cheers to the spree!
🥕🐇 Happy hopping in our PR!


🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
crates/autopilot/src/domain/competition/participation_guard/db.rs (1)

57-108: Consider a debounce or rate limit on database queries.

start_maintenance triggers a database query whenever a new settlement signal is received. If updates arrive in rapid succession, this could cause overhead. You might consider batching or debouncing queries to reduce load.

crates/database/src/solver_competition.rs (2)

100-149: Consider adding indexes for better query performance.

The find_non_settling_solvers function performs multiple joins on auction_id and solver columns. Ensure that indexes exist for proposed_solutions(auction_id, solver) and settlements(auction_id, solver) to help optimize these queries.


590-755: Reevaluate the ignored test.

postgres_non_settling_solvers_roundtrip is marked with #[ignore], so it won’t run by default. Consider enabling it or documenting why it remains ignored, ensuring fixes to any environment or data dependencies so it can be regularly tested.

crates/autopilot/src/infra/solvers/mod.rs (1)

24-24: Document the new field's purpose and impact.

The accepts_unsettled_blocking field lacks documentation explaining its purpose and how it affects solver behavior.

Add documentation above the field:

+    /// Whether this solver accepts solutions that contain unsettled blocking transactions.
+    /// If false, the solver will be temporarily excluded from participating in auctions
+    /// when it has unsettled solutions.
     pub accepts_unsettled_blocking: bool,
crates/autopilot/src/domain/settlement/observer.rs (1)

63-68: Reconsider the trigger mechanism for solver participation validation.

Based on past review comments, using settlement updates as a trigger for the solver participation guard might not be reliable enough. A solver could start winning without settling for multiple auctions before this trigger is hit.

Consider implementing an additional block-based trigger mechanism that proactively checks for non-settling solvers on each new block, as this would provide more timely detection of problematic behavior.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6b7f719 and 366611d.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • crates/autopilot/Cargo.toml (1 hunks)
  • crates/autopilot/src/arguments.rs (8 hunks)
  • crates/autopilot/src/database/competition.rs (1 hunks)
  • crates/autopilot/src/domain/competition/mod.rs (1 hunks)
  • crates/autopilot/src/domain/competition/participation_guard/db.rs (1 hunks)
  • crates/autopilot/src/domain/competition/participation_guard/mod.rs (1 hunks)
  • crates/autopilot/src/domain/competition/participation_guard/onchain.rs (1 hunks)
  • crates/autopilot/src/domain/mod.rs (1 hunks)
  • crates/autopilot/src/domain/settlement/observer.rs (2 hunks)
  • crates/autopilot/src/infra/solvers/mod.rs (3 hunks)
  • crates/autopilot/src/run.rs (5 hunks)
  • crates/autopilot/src/run_loop.rs (5 hunks)
  • crates/database/src/lib.rs (1 hunks)
  • crates/database/src/solver_competition.rs (3 hunks)
  • crates/driver/src/domain/competition/bad_tokens/metrics.rs (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • crates/driver/src/domain/competition/bad_tokens/metrics.rs
🔇 Additional comments (22)
crates/autopilot/src/domain/competition/participation_guard/mod.rs (3)

21-21: Remove redundant +Send +Sync bounds.

The trait definition at line 69 already explicitly requires implementors to be Send + Sync. Hence, specifying "+ Send + Sync" again in the vector type is redundant.

Also applies to: 32-32


25-51: Consider verifying the reliability of current_block().clone().

Since the database-based validator depends on the current block, ensure that calling eth.current_block().clone() provides an accurate and up-to-date block number. If the block number is crucial for eligibility checks, you might consider refreshing it or handling potential stale data from the block watcher.

Would you like me to generate a script to locate all references to current_block usage and verify if it’s always up to date?


53-67: LGTM: Sequential validator flow is clear and concise.

The flow for can_participate is straightforward and reduces redundant RPC calls by checking each validator sequentially. This approach is well-documented in the comments.

crates/autopilot/src/domain/competition/participation_guard/db.rs (2)

68-81: Refactor driver-name extraction in a single pass.

You can avoid collecting non_settling_drivers first and then mapping them again to driver names. Instead, you can use an approach like ‘unzip’ to extract both the driver references and their names in one operation.


93-98: Filter out disabled drivers earlier.

You can skip drivers that have accepts_unsettled_blocking = false at the constructor level instead of filtering them at runtime in the start_maintenance method. This approach aligns with a past suggestion and can help streamline logic.

crates/autopilot/src/domain/competition/participation_guard/onchain.rs (1)

1-20: LGTM! Clean implementation of on-chain validator.

The implementation correctly moves the on-chain call from the Authenticator to this new struct, with proper error handling and async/await patterns.

crates/autopilot/src/domain/mod.rs (2)

21-23: Fix grammatical issue in comment.

The comment should be "How many times the solver was marked" instead of "How many times the solver was marked".


18-31: LGTM! Well-structured metrics implementation.

The implementation correctly uses prometheus for tracking non-settling solvers, with proper labeling and thread-safe static instance pattern.

crates/autopilot/src/domain/competition/mod.rs (1)

8-14: LGTM! Clean module organization.

The changes properly integrate the new participation guard functionality while maintaining clean module organization.

crates/database/src/lib.rs (1)

52-82: Verify database schema consistency.

The table structure shows potential redundancy. Please verify:

  1. The relationship between "orders" and "order_quotes"
  2. The distinction between different types of order-related tables
  3. Data consistency across related tables

This will help ensure the schema restructuring maintains data integrity.

Run this script to analyze table relationships:

crates/autopilot/src/database/competition.rs (2)

143-146: LGTM! Clear and descriptive documentation.

The documentation clearly explains the purpose and behavior of the function.


147-165: Consider enhancing the solver participation criteria.

Based on the past review comments, the current implementation only checks for solvers that win consecutive auctions without settling. However, it might be beneficial to also consider solvers that win less frequently but consistently fail to settle their solutions.

Let's verify if there are any existing metrics for tracking settlement failures:

crates/autopilot/src/infra/solvers/mod.rs (1)

42-42: LGTM! Consistent parameter propagation.

The accepts_unsettled_blocking parameter is consistently propagated through driver initialization.

Also applies to: 75-75

crates/autopilot/src/domain/settlement/observer.rs (1)

22-22: LGTM! Clear field declaration.

The field type clearly indicates its purpose for sending settlement update notifications.

crates/autopilot/src/run.rs (3)

368-369: LGTM! Clear channel setup.

The unbounded channel is appropriately created for settlement updates.


580-589: LGTM! Comprehensive guard initialization.

The SolverParticipationGuard is properly initialized with all necessary dependencies, including Ethereum instance, database, settlement updates receiver, and driver information.


596-596: LGTM! Proper integration with RunLoop.

The solver participation guard is correctly integrated into the RunLoop initialization.

crates/autopilot/src/arguments.rs (2)

413-418: LGTM!

The addition of accepts_unsettled_blocking field to the Solver struct aligns with the requirement to make the participation guard opt-in on a solver-by-solver basis.


466-486: LGTM!

The parsing logic for accepts_unsettled_blocking flag is correctly implemented, allowing solvers to opt-in to the participation guard.

crates/autopilot/src/run_loop.rs (2)

738-742: LGTM!

The error handling for participation check failures is appropriate, using error! log level to indicate system-level failures.


744-747: 🛠️ Refactor suggestion

Add notification logic for solver participation status.

Based on the past review comments, we should notify the driver when they are denied participation or miss an auction. This helps external teams debug issues immediately.

Apply this diff to add notification logic:

     // Do not send the request to the driver if the solver is deny-listed
     if !can_participate {
+        // Notify the driver about their participation status
+        if let Err(err) = driver.notify_participation_status(false).await {
+            tracing::warn!(?err, driver = %driver.name, "failed to notify driver about participation status");
+        }
         return Err(SolveError::SolverDenyListed);
     }

Likely invalid or redundant comment.

crates/autopilot/Cargo.toml (1)

28-28: LGTM!

The addition of the dashmap dependency is appropriate for efficient concurrent access to the solver participation cache.

crates/autopilot/src/arguments.rs Show resolved Hide resolved
@squadgazzz
Copy link
Contributor Author

Validator calls one infra function that asks the db whether the solver is supposed to be banned (in 'just in time' manner) and the whole logic is implemented synchronously.

@sunce86 , there is already another proposal for additional db-based statistics. These SQL queries are not so light we could afford to execute them right inside the run loop.

@sunce86
Copy link
Contributor

sunce86 commented Feb 7, 2025

Validator calls one infra function that asks the db whether the solver is supposed to be banned (in 'just in time' manner) and the whole logic is implemented synchronously.

@sunce86 , there is already another proposal for additional db-based statistics. These SQL queries are not so light we could afford to execute them right inside the run loop.

Understood.

So, as you already mentioned ☝️ , Validator maintenance can also be triggered by each new competition saved to database.
Can we then:

  1. On each new competition saved, infra::Persistence::save_solutions triggers the Validator maintenance loop (it could even pass saved Competition object or just the data validator needs).
  2. If not passed, Validator fetches the latest competition from infra (db) and puts it into it's cache
  3. All types of validators analyze the same data cached in Validator.

Any issue with this approuch?

@squadgazzz
Copy link
Contributor Author

Any issue with this approuch?

If I got it correctly, you mean some kind of FIFO cache. There is still the settlements data that needs to be fetched from the DB. I thought about this initially, but it adds more complexity to the already non-trivial solution. With the current approach, all the data is received from one source.

@sunce86
Copy link
Contributor

sunce86 commented Feb 10, 2025

There is still the settlements data that needs to be fetched from the DB.

Maybe Validator can be signalled from two sources:

  1. Inform Validator when each competition/auction is saved.
  2. Inform Validator when each settlement is observed onchain.

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

@squadgazzz
Copy link
Contributor Author

Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block).

Why I initially didn't go with this approach(mostly because of the second point):

  1. At first glance, it seems more complex than SQL queries. Retrieve two different types of data from different sources. Update the data accordingly. Maintain a reasonable cache size.
  2. On each restart, for the statistic-based validators, it would require either executing an SQL query to populate the initial data or waiting for N auctions to accumulate enough data to start blocking the solvers.

@sunce86 Does it make sense? I am not against implementing a more complex and probably more efficient approach, but the only benefit I see is the reduction of DB queries.

@squadgazzz squadgazzz requested a review from sunce86 February 11, 2025 21:06
Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits.

Do you have in plan writing an e2e test for this? Otherwise I'm afraid we won't have a high conviction it's working properly and we would hope for the best in prod 🙏

@squadgazzz
Copy link
Contributor Author

Do you have in plan writing an e2e test for this?

Yep, I will open another PR since 600+ lines of the code are already too much.

Copy link
Contributor

@sunce86 sunce86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG.

This new logic will always return at least one non-settling solver right? The one that is currently being settled. Not sure what to do with this information but you might want to skip logging/alerting for that special one.

@squadgazzz
Copy link
Contributor Author

This new logic will always return at least one non-settling solver right? The one that is currently being settled.

I didn't get why is that. The currently settling auction gets filtered out in the query because of this: https://github.com/cowprotocol/services/pull/3257/files#diff-ecc7354b24bcc39d93bfb90181abe577203cc25d8f94c9886b2f5f3f1b7894d5R112

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Kick out solver from competition if not settling
4 participants