Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement deletion of resources cleanly #1154

Merged
merged 36 commits into from
Jan 14, 2025
Merged

Implement deletion of resources cleanly #1154

merged 36 commits into from
Jan 14, 2025

Conversation

diptanu
Copy link
Collaborator

@diptanu diptanu commented Jan 12, 2025

  • Implemented deletion of graphs, invocations, executors in a way which keeps the state store consistent to what the scheduler and asynchronous event processors are working on.
  • Divided the state changes into global_* and namespace|compute_graph|<optional - invocation_id>|id so that we can process in order the global events and in order namespace or compute graph related events. This will allow us to process events across namespace in a more fair manner in the future using prefix scans. It will require some more work, i.e prefix scan N elements and jump from one namespace to another.
  • Serialized Task Allocation, Task Creation, Invocation Deletion, CG Deletion, etc, since there is no point in doing extra work in allocating tasks if invocations are deleted, etc. This doesn't prevent us from processing them concurrently in the future, as we can always read 10 elements from every group of state changes and do fork-join of task creation and allocation, merge all the results and create a single state machine update. But I wanted to shoot for correctness first, without any locks in the databases before optimizing.
  • The updates to compute graphs - such as tombstoning it, updating it and adding new invocations are still using exclusive transactions in rocksdb, I left it that way because when we introduce raft, they would be serialized anyways.
  • There was a bug which could allocate tasks twice if an executor came online before tasks were allocated and their events were behind the executor registered event because we read all the unalloacated tasks which included tasks for which we also had state change events. We are now not putting new tasks into unallocated task CF, instead they go there after we try placing them for the first time or if the executor on which they are allocated dies.

Differences in speed - Main for 10k tasks - 32 seconds
This branch - 36 seconds
There is no batching here, vs in main I think there are some updates to the SM which is batched. We need CoW to be able to batch. Irrespective of how the SM is updated, in a batch of update a state change needs to be updated to the state only after the previous one was updated. Reference to this in the Omega paper - "3.4 Shared-state scheduling"

Still have to fix tests, etc.

@diptanu diptanu merged commit f56884d into main Jan 14, 2025
8 checks passed
@diptanu diptanu deleted the raft branch January 14, 2025 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant