Store HEAD reference in persistent storage #978

zaychenko-sergei · 2024-12-04T19:29:49Z

Blocked by #979.

Checking head is one of the most frequent requests to the dataset storage system for many algorithms.

Some key actions like ingest, sync, transform, reset, compacting, protocol operations - modify the HEAD, and this is expected to be done transactionally, and in case of parallel operations, should fail without hurting the dataset.

Writing HEAD also happens during task execution operations in the background (update, compact, reset flows).

To do in this ticket:

design how HEAD pointer of each dataset can be stored in the database
design and implement repositories for in-memory and Postgres/SQLite storages
consider how to effeciently cache the head reference within the current operation scope
consider how to isolate set_ref operations better (preferably in use cases)
where necessary, restructure services to support the goal

As an outcome we expect:

transactionality of the HEAD advancement
slightly improved performance in deployment envs (less S3 calls, quicker and less frequent database calls instead).

Implementation notes

Setting HEAD ref the new way:

None of the services should initiate set_ref, they all should append content to the blocks/data/checkpoint repositories, and only suggest a new HEAD.
Avoid append with automatic reference update, separate those calls
Reset dataset: without set_refit mostly becomes a planning problem
Calling set_ref is allowed at use cases level (already within transaction boundary).
Calling set_ref is also allowed in TaskExecutor. New suggested HEAD is already returned as TaskResult from tasks runner. Another transaction should be open to commit the HEAD.
Within datasets domain, an implementation of set_ref on the writable MetadataChain should initiate a number of synchronous updates, such as:
- updating HEAD reference in DB
- updating dependency graph in DB and in-memory (if dependencies changed)
  - get rid of “dependencies updated” event, and use HEAD update instead
- future: updating cached key metadata blocks
- initiate new Outbox event, saying “Dataset head has updated”:
  - in the same transaction as setting HEAD ref in the database
  - storage-level set_ref should be called as durable outbox consumer of this event

The text was updated successfully, but these errors were encountered:

zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Dec 4, 2024

zaychenko-sergei self-assigned this Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store HEAD reference in persistent storage #978

Store HEAD reference in persistent storage #978

zaychenko-sergei commented Dec 4, 2024 •

edited

Loading

Store HEAD reference in persistent storage #978

Store HEAD reference in persistent storage #978

Comments

zaychenko-sergei commented Dec 4, 2024 • edited Loading

Implementation notes

zaychenko-sergei commented Dec 4, 2024 •

edited

Loading