-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solver participation guard #3257
base: main
Are you sure you want to change the base?
Conversation
crates/autopilot/src/domain/competition/solver_participation_guard.rs
Outdated
Show resolved
Hide resolved
f69e174
to
5fc831e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach which uses trait based validators list, as it is easily extensible.
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I initially thought about this task, it was much simpler in my head.
Do we really need to do this in async way? Sure it is probably more efficient since we avoid spending time on the critical runloop path, but it does come with a cost of > 600 lines added, coupling this component with settlement::Observer
and introducing caching as well.
Setting the review to "blocking" because of the mentioned issue with settlement::Observer
. 👇
Why not just:
- Runloop starts
- Validator calls one
infra
function that asks the db whether the solver is supposed to be banned (in 'just in time' manner) and the whole logic is implemented synchronously.
I also don't buy the redundant db calls argument since we call this functionality once per auction at most.
WalkthroughThis pull request introduces several additions and modifications across multiple modules. A new dependency is added in Cargo.toml, and key structures in the autopilot package are updated with extra fields and configuration for managing solver participation. New asynchronous methods are implemented to identify non-settling solvers both at the database level and via SQL queries. A dedicated domain module (SolverParticipationGuard) is created with both on‐chain and database validators to determine solver eligibility. Additionally, settlement observer functionality is enhanced and integrated into the run loop, with a minor comment fix in the driver metrics code. Changes
Sequence Diagram(s)sequenceDiagram
participant RL as RunLoop
participant SPG as SolverParticipationGuard
participant DB as DB Validator
participant OC as Onchain Validator
RL->>SPG: can_participate(solver)
SPG->>DB: is_allowed(solver)
DB-->>SPG: returns allowed (bool)
alt Further check required
SPG->>OC: is_allowed(solver)
OC-->>SPG: returns allowed (bool)
end
SPG-->>RL: participation result
Assessment against linked issues
Poem
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
crates/autopilot/src/domain/competition/participation_guard/db.rs (1)
57-108
: Consider a debounce or rate limit on database queries.start_maintenance triggers a database query whenever a new settlement signal is received. If updates arrive in rapid succession, this could cause overhead. You might consider batching or debouncing queries to reduce load.
crates/database/src/solver_competition.rs (2)
100-149
: Consider adding indexes for better query performance.The find_non_settling_solvers function performs multiple joins on auction_id and solver columns. Ensure that indexes exist for proposed_solutions(auction_id, solver) and settlements(auction_id, solver) to help optimize these queries.
590-755
: Reevaluate the ignored test.postgres_non_settling_solvers_roundtrip is marked with #[ignore], so it won’t run by default. Consider enabling it or documenting why it remains ignored, ensuring fixes to any environment or data dependencies so it can be regularly tested.
crates/autopilot/src/infra/solvers/mod.rs (1)
24-24
: Document the new field's purpose and impact.The
accepts_unsettled_blocking
field lacks documentation explaining its purpose and how it affects solver behavior.Add documentation above the field:
+ /// Whether this solver accepts solutions that contain unsettled blocking transactions. + /// If false, the solver will be temporarily excluded from participating in auctions + /// when it has unsettled solutions. pub accepts_unsettled_blocking: bool,crates/autopilot/src/domain/settlement/observer.rs (1)
63-68
: Reconsider the trigger mechanism for solver participation validation.Based on past review comments, using settlement updates as a trigger for the solver participation guard might not be reliable enough. A solver could start winning without settling for multiple auctions before this trigger is hit.
Consider implementing an additional block-based trigger mechanism that proactively checks for non-settling solvers on each new block, as this would provide more timely detection of problematic behavior.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (15)
crates/autopilot/Cargo.toml
(1 hunks)crates/autopilot/src/arguments.rs
(8 hunks)crates/autopilot/src/database/competition.rs
(1 hunks)crates/autopilot/src/domain/competition/mod.rs
(1 hunks)crates/autopilot/src/domain/competition/participation_guard/db.rs
(1 hunks)crates/autopilot/src/domain/competition/participation_guard/mod.rs
(1 hunks)crates/autopilot/src/domain/competition/participation_guard/onchain.rs
(1 hunks)crates/autopilot/src/domain/mod.rs
(1 hunks)crates/autopilot/src/domain/settlement/observer.rs
(2 hunks)crates/autopilot/src/infra/solvers/mod.rs
(3 hunks)crates/autopilot/src/run.rs
(5 hunks)crates/autopilot/src/run_loop.rs
(5 hunks)crates/database/src/lib.rs
(1 hunks)crates/database/src/solver_competition.rs
(3 hunks)crates/driver/src/domain/competition/bad_tokens/metrics.rs
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- crates/driver/src/domain/competition/bad_tokens/metrics.rs
🔇 Additional comments (22)
crates/autopilot/src/domain/competition/participation_guard/mod.rs (3)
21-21
: Remove redundant +Send +Sync bounds.The trait definition at line 69 already explicitly requires implementors to be Send + Sync. Hence, specifying "+ Send + Sync" again in the vector type is redundant.
Also applies to: 32-32
25-51
: Consider verifying the reliability of current_block().clone().Since the database-based validator depends on the current block, ensure that calling eth.current_block().clone() provides an accurate and up-to-date block number. If the block number is crucial for eligibility checks, you might consider refreshing it or handling potential stale data from the block watcher.
Would you like me to generate a script to locate all references to current_block usage and verify if it’s always up to date?
53-67
: LGTM: Sequential validator flow is clear and concise.The flow for can_participate is straightforward and reduces redundant RPC calls by checking each validator sequentially. This approach is well-documented in the comments.
crates/autopilot/src/domain/competition/participation_guard/db.rs (2)
68-81
: Refactor driver-name extraction in a single pass.You can avoid collecting non_settling_drivers first and then mapping them again to driver names. Instead, you can use an approach like ‘unzip’ to extract both the driver references and their names in one operation.
93-98
: Filter out disabled drivers earlier.You can skip drivers that have accepts_unsettled_blocking = false at the constructor level instead of filtering them at runtime in the start_maintenance method. This approach aligns with a past suggestion and can help streamline logic.
crates/autopilot/src/domain/competition/participation_guard/onchain.rs (1)
1-20
: LGTM! Clean implementation of on-chain validator.The implementation correctly moves the on-chain call from the Authenticator to this new struct, with proper error handling and async/await patterns.
crates/autopilot/src/domain/mod.rs (2)
21-23
: Fix grammatical issue in comment.The comment should be "How many times the solver was marked" instead of "How many times the solver was marked".
18-31
: LGTM! Well-structured metrics implementation.The implementation correctly uses prometheus for tracking non-settling solvers, with proper labeling and thread-safe static instance pattern.
crates/autopilot/src/domain/competition/mod.rs (1)
8-14
: LGTM! Clean module organization.The changes properly integrate the new participation guard functionality while maintaining clean module organization.
crates/database/src/lib.rs (1)
52-82
: Verify database schema consistency.The table structure shows potential redundancy. Please verify:
- The relationship between "orders" and "order_quotes"
- The distinction between different types of order-related tables
- Data consistency across related tables
This will help ensure the schema restructuring maintains data integrity.
Run this script to analyze table relationships:
crates/autopilot/src/database/competition.rs (2)
143-146
: LGTM! Clear and descriptive documentation.The documentation clearly explains the purpose and behavior of the function.
147-165
: Consider enhancing the solver participation criteria.Based on the past review comments, the current implementation only checks for solvers that win consecutive auctions without settling. However, it might be beneficial to also consider solvers that win less frequently but consistently fail to settle their solutions.
Let's verify if there are any existing metrics for tracking settlement failures:
crates/autopilot/src/infra/solvers/mod.rs (1)
42-42
: LGTM! Consistent parameter propagation.The
accepts_unsettled_blocking
parameter is consistently propagated through driver initialization.Also applies to: 75-75
crates/autopilot/src/domain/settlement/observer.rs (1)
22-22
: LGTM! Clear field declaration.The field type clearly indicates its purpose for sending settlement update notifications.
crates/autopilot/src/run.rs (3)
368-369
: LGTM! Clear channel setup.The unbounded channel is appropriately created for settlement updates.
580-589
: LGTM! Comprehensive guard initialization.The
SolverParticipationGuard
is properly initialized with all necessary dependencies, including Ethereum instance, database, settlement updates receiver, and driver information.
596-596
: LGTM! Proper integration with RunLoop.The solver participation guard is correctly integrated into the
RunLoop
initialization.crates/autopilot/src/arguments.rs (2)
413-418
: LGTM!The addition of
accepts_unsettled_blocking
field to theSolver
struct aligns with the requirement to make the participation guard opt-in on a solver-by-solver basis.
466-486
: LGTM!The parsing logic for
accepts_unsettled_blocking
flag is correctly implemented, allowing solvers to opt-in to the participation guard.crates/autopilot/src/run_loop.rs (2)
738-742
: LGTM!The error handling for participation check failures is appropriate, using
error!
log level to indicate system-level failures.
744-747
: 🛠️ Refactor suggestionAdd notification logic for solver participation status.
Based on the past review comments, we should notify the driver when they are denied participation or miss an auction. This helps external teams debug issues immediately.
Apply this diff to add notification logic:
// Do not send the request to the driver if the solver is deny-listed if !can_participate { + // Notify the driver about their participation status + if let Err(err) = driver.notify_participation_status(false).await { + tracing::warn!(?err, driver = %driver.name, "failed to notify driver about participation status"); + } return Err(SolveError::SolverDenyListed); }Likely invalid or redundant comment.
crates/autopilot/Cargo.toml (1)
28-28
: LGTM!The addition of the
dashmap
dependency is appropriate for efficient concurrent access to the solver participation cache.
@sunce86 , there is already another proposal for additional db-based statistics. These SQL queries are not so light we could afford to execute them right inside the run loop. |
Understood. So, as you already mentioned ☝️ ,
Any issue with this approuch? |
If I got it correctly, you mean some kind of FIFO cache. There is still the settlements data that needs to be fetched from the DB. I thought about this initially, but it adds more complexity to the already non-trivial solution. With the current approach, all the data is received from one source. |
Maybe Validator can be signalled from two sources:
Then combine those data internally in Validator to match each settlement to each competition/auction and deduct which competitions ended up without settlement (using auction deadline block). |
Why I initially didn't go with this approach(mostly because of the second point):
@sunce86 Does it make sense? I am not against implementing a more complex and probably more efficient approach, but the only benefit I see is the reduction of DB queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits.
Do you have in plan writing an e2e test for this? Otherwise I'm afraid we won't have a high conviction it's working properly and we would hope for the best in prod 🙏
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
crates/autopilot/src/domain/competition/participation_guard/mod.rs
Outdated
Show resolved
Hide resolved
Yep, I will open another PR since 600+ lines of the code are already too much. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG.
This new logic will always return at least one non-settling solver right? The one that is currently being settled. Not sure what to do with this information but you might want to skip logging/alerting for that special one.
crates/autopilot/src/domain/competition/participation_guard/db.rs
Outdated
Show resolved
Hide resolved
I didn't get why is that. The currently settling auction gets filtered out in the query because of this: https://github.com/cowprotocol/services/pull/3257/files#diff-ecc7354b24bcc39d93bfb90181abe577203cc25d8f94c9886b2f5f3f1b7894d5R112 |
# Conflicts: # crates/database/src/lib.rs
Description
From the original issue:
This PR implements it by introducing a new struct, which checks whether the solver is allowed to participate in the next competition by using two different approaches:
Authenticator
'sis_solver
on-chain call into the new struct.Observer
struct, so the cache gets updated only once theObserver
has some result.These validators are called sequentially to avoid redundant RPC calls to
Authenticator
. So it first checks for the DB-based validator cache and, only then, sends the RPC call.Once one of the strategies says the solver is not allowed to participate, it gets deny-listed for 5m(configurable).
Each validator can be enabled/disabled separately in case of any issue.
Metrics
Added a metric that gets populated by the DB-based validator once a solver is marked as banned. The idea is to create an alert that is sent if there are more than 4 such occurrences for the last 30 minutes for the same solver, meaning it should be considered disabling the solver.
Open discussions
Since the current SQL query filters out auctions where a deadline has not been reached, the following case is possible:
The solver gets banned, while the same solver has a pending settlement. In case this gets settled, the solver remains banned. While this is a niche case, it would be better to unblock the solver before the cache TTL deadline is reached. This has not been implemented in the current PR since some refactoring is required in the Observer struct. If this is approved, it can be implemented quickly.
Whether it makes sense to introduce a metrics-based strategy similar to the bad token detector's where the solver gets banned in case >95%(or similar) of settlements fail.
How to test
A new SQL query test. Existing e2e tests.
Related Issues
Fixes #3221
Summary by CodeRabbit
New Features
- Introduced advanced solver participation controls with configurable eligibility checks, integrating both on-chain and database validations.
- Enabled asynchronous real-time notifications for settlement updates, enhancing system responsiveness.
- Added metrics tracking to monitor auction participation and performance.
Chores
- Updated internal dependencies and restructured driver configuration.
- Reorganized the database schema to support improved auction and settlement processing.