-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH]: TaskRunner based off CompactionManager #5675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
45fe110 to
19d4ecf
Compare
|
Introduction of TaskRunner based on CompactionManager This pull request introduces the The core of the change is a new task execution lifecycle managed by a series of operators. A Key Changes• Introduced Affected Areas• rust/worker/src/compactor/ This summary was automatically generated by @propel-code-bot |
19d4ecf to
047668f
Compare
047668f to
9bd1fb2
Compare
| JobMode::Compaction => { | ||
| let jobs_iter = self.scheduler.get_jobs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[PerformanceOptimization]
Potential N+1 query pattern:
let jobs_iter = self.scheduler.get_jobs();
for job in jobs_iter {
// ... process each job
}While not a direct database query, verify that get_jobs() doesn't internally make individual queries for each job. If it does, consider fetching all jobs in a single batch operation before the loop.
Context for Agents
[**PerformanceOptimization**]
Potential N+1 query pattern:
```rust
let jobs_iter = self.scheduler.get_jobs();
for job in jobs_iter {
// ... process each job
}
```
While not a direct database query, verify that `get_jobs()` doesn't internally make individual queries for each job. If it does, consider fetching all jobs in a single batch operation before the loop.
File: rust/worker/src/compactor/compaction_manager.rs
Line: 194| std::time::SystemTime::UNIX_EPOCH + std::time::Duration::from_micros(task.next_run_at); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Potential integer overflow in timestamp conversion:
let next_run = std::time::UNIX_EPOCH + std::time::Duration::from_micros(task.next_run_at);from_micros() can panic if task.next_run_at is too large to fit in a Duration. This could happen with invalid or corrupted database values. Use checked_add or validate the timestamp range before conversion:
let duration = std::time::Duration::from_micros(task.next_run_at);
let next_run = std::time::UNIX_EPOCH.checked_add(duration)
.ok_or_else(|| GetTaskError::ServerReturnedInvalidData)?;Context for Agents
[**BestPractice**]
Potential integer overflow in timestamp conversion:
```rust
let next_run = std::time::UNIX_EPOCH + std::time::Duration::from_micros(task.next_run_at);
```
`from_micros()` can panic if `task.next_run_at` is too large to fit in a Duration. This could happen with invalid or corrupted database values. Use `checked_add` or validate the timestamp range before conversion:
```rust
let duration = std::time::Duration::from_micros(task.next_run_at);
let next_run = std::time::UNIX_EPOCH.checked_add(duration)
.ok_or_else(|| GetTaskError::ServerReturnedInvalidData)?;
```
File: rust/sysdb/src/sysdb.rs
Line: 1968| suite.Require().Equal(common.ErrTaskNotFound, err) | ||
|
|
||
| suite.db.Unscoped().Delete(&dbmodel.Task{}, "task_id = ?", task.ID) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[TestCoverage]
The new functions FinishTask and UpdateCompletionOffset have been added to taskDb, but only UpdateCompletionOffset has a test. It would be beneficial to add a test case for FinishTask as well to ensure the nonce logic for completing a task epoch is fully covered.
Context for Agents
[**TestCoverage**]
The new functions `FinishTask` and `UpdateCompletionOffset` have been added to `taskDb`, but only `UpdateCompletionOffset` has a test. It would be beneficial to add a test case for `FinishTask` as well to ensure the nonce logic for completing a task epoch is fully covered.
File: go/pkg/sysdb/metastore/db/dao/task_test.go
Line: 520| MinRecordsForTask: uint64(task.MinRecordsForTask), | ||
| TenantId: task.TenantID, | ||
| DatabaseId: task.DatabaseID, | ||
| NextRunAt: uint64(task.NextRun.UnixMicro()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
In GetTaskByName, NextRunAt is converted using UnixMicro(). However, in PeekScheduleByCollectionId (line 539) and AdvanceTask (line 491), UnixMilli() is used for similar timestamp fields. For consistency, should this also be UnixMilli() unless microsecond precision is specifically required here?
Context for Agents
[**BestPractice**]
In `GetTaskByName`, `NextRunAt` is converted using `UnixMicro()`. However, in `PeekScheduleByCollectionId` (line 539) and `AdvanceTask` (line 491), `UnixMilli()` is used for similar timestamp fields. For consistency, should this also be `UnixMilli()` unless microsecond precision is specifically required here?
File: go/pkg/sysdb/coordinator/task.go
Line: 210|
3 Jobs Failed: PR checks / all-required-pr-checks-passed failed on "Decide whether the needed jobs succeeded or failed"PR checks / Go tests / cluster-test failed on "Run bin/cluster-test.sh bash -c 'cd go && make test'"PR checks / Lint failed on "Clippy"Summary: 1 successful workflow, 1 failed workflow
Last updated: 2025-10-21 00:45:57 UTC |

Description of changes
This change implements the TaskRunner that pulls scheduled Tasks from the heap and runs them. This reuses a large part of the CompactionManager's code with the following additions:
next_nonce in theTasks table in the SysDB.Tasks.lowest_live_nonce != Tasks.next_nonce.next_nonce does not occur. This case implies that there was an incomplete execution oflowest_live_nonce. PrepareTask compares thescout_logs result of the given input collection withTasks.completion_offset to determine whether this Task needs to be re-executed or whether it can skip to the FinishTask operator.completion_offset that we sent to SysDB in Register, we will schedule a new item in the heap to do this task again at a later time. We then unconditionally setlowest_live_nonce tonext_nonce in the SysDB, markinglowest_live_nonce as completed and allowing all heap entries below the new value oflowest_live_nonce to be pruned.Relevant TODOs after this:
lowest_live_nonceFinishTask if thescout_logs check deems it necessary.create_task to do a 2-phase commit and adjust the TaskRunner logic to handle partially created tasks.create_task must also send over an initial backfill item to the heap if its input collection existed before this Task.Test plan
How are these changes tested?
An integration test has been added in compact.rs.
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_