Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store HEAD reference in persistent storage #978

Open
zaychenko-sergei opened this issue Dec 4, 2024 · 0 comments
Open

Store HEAD reference in persistent storage #978

zaychenko-sergei opened this issue Dec 4, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@zaychenko-sergei
Copy link
Contributor

zaychenko-sergei commented Dec 4, 2024

Blocked by #979.

Checking head is one of the most frequent requests to the dataset storage system for many algorithms.

Some key actions like ingest, sync, transform, reset, compacting, protocol operations - modify the HEAD, and this is expected to be done transactionally, and in case of parallel operations, should fail without hurting the dataset.

Writing HEAD also happens during task execution operations in the background (update, compact, reset flows).

To do in this ticket:

  • design how HEAD pointer of each dataset can be stored in the database
  • design and implement repositories for in-memory and Postgres/SQLite storages
  • consider how to effeciently cache the head reference within the current operation scope
  • consider how to isolate set_ref operations better (preferably in use cases)
  • where necessary, restructure services to support the goal

As an outcome we expect:

  • transactionality of the HEAD advancement
  • slightly improved performance in deployment envs (less S3 calls, quicker and less frequent database calls instead).

Implementation notes

Setting HEAD ref the new way:

  • None of the services should initiate set_ref, they all should append content to the blocks/data/checkpoint repositories, and only suggest a new HEAD.
  • Avoid append with automatic reference update, separate those calls
  • Reset dataset: without set_refit mostly becomes a planning problem
  • Calling set_ref is allowed at use cases level (already within transaction boundary).
  • Calling set_ref is also allowed in TaskExecutor. New suggested HEAD is already returned as TaskResult from tasks runner. Another transaction should be open to commit the HEAD.
  • Within datasets domain, an implementation of set_ref on the writable MetadataChain should initiate a number of synchronous updates, such as:
    • updating HEAD reference in DB
    • updating dependency graph in DB and in-memory (if dependencies changed)
      • get rid of “dependencies updated” event, and use HEAD update instead
    • future: updating cached key metadata blocks
    • initiate new Outbox event, saying “Dataset head has updated”:
      • in the same transaction as setting HEAD ref in the database
      • storage-level set_ref should be called as durable outbox consumer of this event
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant