Implement deletion of resources cleanly #1154

diptanu · 2025-01-12T08:34:06Z

Implemented deletion of graphs, invocations, executors in a way which keeps the state store consistent to what the scheduler and asynchronous event processors are working on.
Divided the state changes into global_* and namespace|compute_graph|<optional - invocation_id>|id so that we can process in order the global events and in order namespace or compute graph related events. This will allow us to process events across namespace in a more fair manner in the future using prefix scans. It will require some more work, i.e prefix scan N elements and jump from one namespace to another.
Serialized Task Allocation, Task Creation, Invocation Deletion, CG Deletion, etc, since there is no point in doing extra work in allocating tasks if invocations are deleted, etc. This doesn't prevent us from processing them concurrently in the future, as we can always read 10 elements from every group of state changes and do fork-join of task creation and allocation, merge all the results and create a single state machine update. But I wanted to shoot for correctness first, without any locks in the databases before optimizing.
The updates to compute graphs - such as tombstoning it, updating it and adding new invocations are still using exclusive transactions in rocksdb, I left it that way because when we introduce raft, they would be serialized anyways.
There was a bug which could allocate tasks twice if an executor came online before tasks were allocated and their events were behind the executor registered event because we read all the unalloacated tasks which included tasks for which we also had state change events. We are now not putting new tasks into unallocated task CF, instead they go there after we try placing them for the first time or if the executor on which they are allocated dies.

Differences in speed - Main for 10k tasks - 32 seconds
This branch - 36 seconds
There is no batching here, vs in main I think there are some updates to the SM which is batched. We need CoW to be able to batch. Irrespective of how the SM is updated, in a batch of update a state change needs to be updated to the state only after the previous one was updated. Reference to this in the Omega paper - "3.4 Shared-state scheduling"

Still have to fix tests, etc.

…pdate failures

diptanu added 30 commits January 13, 2025 09:45

update crates

94b8946

fmt

0fc75e7

updated axum metrics

f408669

update dependencies

f62e519

bring back every other metrics

b96cf68

update rocksdb

aaf560e

simplifying code for updating compute graphs

9bfea04

Rewired api to state machine requests for user facing APIs

7d8ba08

updated state store to use graph processor

3ff5885

update code

75f7473

renamed stuff

3804a16

update

19bebf7

implemented deletion

683c945

merge conflicts

b30ec74

scheduling tasks when a new exeuctor comes online

d6789c0

making seek more efficient

64a95b5

update graph processor

84dd3e7

fixing GC

d0f2a5b

Making graph processor make progress even if there are state change u…

798ca17

…pdate failures

shutting down slatedb cleanly

2453712

refreshing executors less frequently

99ab374

fixed a bunch of tests

8b9c453

fixed tests

e2a99a0

lint

5a185c8

fix test

0a4e32d

fix lint

0648e2e

removed test

2ffbc47

clippy

c7237fa

lint

267f1e3

add some comments

135814f

diptanu added 2 commits January 13, 2025 15:11

Added a test

18719e4

fixing tests

9fbcbdc

diptanu force-pushed the raft branch from 55fa1e8 to 9fbcbdc Compare January 13, 2025 23:17

diptanu added 4 commits January 13, 2025 17:04

update

1d5aa44

fmt

06476f7

creating a tombstone executor event

fd7520c

Updated namespace key schema

322e59a

diptanu merged commit f56884d into main Jan 14, 2025
8 checks passed

diptanu deleted the raft branch January 14, 2025 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement deletion of resources cleanly #1154

Implement deletion of resources cleanly #1154

diptanu commented Jan 12, 2025 •

edited

Loading

Implement deletion of resources cleanly #1154

Implement deletion of resources cleanly #1154

Conversation

diptanu commented Jan 12, 2025 • edited Loading

diptanu commented Jan 12, 2025 •

edited

Loading