You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Checking head is one of the most frequent requests to the dataset storage system for many algorithms.
Some key actions like ingest, sync, transform, reset, compacting, protocol operations - modify the HEAD, and this is expected to be done transactionally, and in case of parallel operations, should fail without hurting the dataset.
Writing HEAD also happens during task execution operations in the background (update, compact, reset flows).
To do in this ticket:
design how HEAD pointer of each dataset can be stored in the database
design and implement repositories for in-memory and Postgres/SQLite storages
consider how to effeciently cache the head reference within the current operation scope
consider how to isolate set_ref operations better (preferably in use cases)
where necessary, restructure services to support the goal
As an outcome we expect:
transactionality of the HEAD advancement
slightly improved performance in deployment envs (less S3 calls, quicker and less frequent database calls instead).
Implementation notes
Setting HEAD ref the new way:
None of the services should initiate set_ref, they all should append content to the blocks/data/checkpoint repositories, and only suggest a new HEAD.
Avoid append with automatic reference update, separate those calls
Reset dataset: without set_refit mostly becomes a planning problem
Calling set_ref is allowed at use cases level (already within transaction boundary).
Calling set_ref is also allowed in TaskExecutor. New suggested HEAD is already returned as TaskResult from tasks runner. Another transaction should be open to commit the HEAD.
Within datasets domain, an implementation of set_ref on the writable MetadataChain should initiate a number of synchronous updates, such as:
updating HEAD reference in DB
updating dependency graph in DB and in-memory (if dependencies changed)
get rid of “dependencies updated” event, and use HEAD update instead
future: updating cached key metadata blocks
initiate new Outbox event, saying “Dataset head has updated”:
in the same transaction as setting HEAD ref in the database
storage-level set_ref should be called as durable outbox consumer of this event
The text was updated successfully, but these errors were encountered:
Blocked by #979.
Checking head is one of the most frequent requests to the dataset storage system for many algorithms.
Some key actions like
ingest
,sync
,transform
,reset
,compacting
, protocol operations - modify the HEAD, and this is expected to be done transactionally, and in case of parallel operations, should fail without hurting the dataset.Writing HEAD also happens during task execution operations in the background (update, compact, reset flows).
To do in this ticket:
set_ref
operations better (preferably in use cases)As an outcome we expect:
Implementation notes
Setting HEAD ref the new way:
set_ref
, they all should append content to the blocks/data/checkpoint repositories, and only suggest a new HEAD.append
with automatic reference update, separate those callsset_ref
it mostly becomes a planning problemset_ref
is allowed at use cases level (already within transaction boundary).set_ref
is also allowed inTaskExecutor
. New suggested HEAD is already returned asTaskResult
from tasks runner. Another transaction should be open to commit the HEAD.datasets
domain, an implementation ofset_ref
on the writableMetadataChain
should initiate a number of synchronous updates, such as:set_ref
should be called as durable outbox consumer of this eventThe text was updated successfully, but these errors were encountered: