Help needed to figure out asynchronous parallel writing during data stream #2066

Alegrowin · 2022-12-19T21:42:26Z

Alegrowin
Dec 19, 2022

Hey hey! Looking for some help/direction to solve this challenge.

Is it possible to write a single tensor from 2 (or more) differents processes/container at the same time?

Ray could help solving the problem but since we need to fire new processes continuously as the data flows inside the system, Ray doesn't fit the streaming approach (a single point is used to distribute the transaction across many 'workers').

Our use case is closer to multiple atomical insert, update on a tensor.

To demonstrate what we are trying to solve for, here an example

Parallel function

def add_sample(value: int, index: int):

    src = 's3://bucket/src'
    ds = deeplake.load(src, read_only=False)
    with ds:
        result = np.array([[index],[value]])
        ds.important_stuff.append(result)
    ds.flush()

Orchestrator launching x processes (x=10 in this scenario but could be any number)

src = 's3://bucket/src'
ds = deeplake.empty(src, overwrite=True)
with ds:
    ds.create_tensor("important_stuff", htype="class_label")

process = []

for index in range(10):
    process.append(Process(target=add_sample,args=(index,index*index)))
    process[index].start()

for index in range(10):
    process[index].join()

ds = deeplake.dataset(src, read_only=True, overwrite=False)
print (ds.important_stuff)
print (ds.important_stuff[0:10].numpy())
# Show only 1 element

The result of this should be an Array of 10 [index, value] elements, instead only a single element is written in the S3 bucket.

Answered by davidbuniat

Dec 21, 2022

Hi @Alegrowin, thanks for posing the question and starting the discussion on this topic. Personally, I am pretty excited about it! Basically you are looking for ACID transaction support on the storage level that any process can concurrently work and modify the data. We have two options.

Short-term: Currently Deep Lake is limited to multi-branch concurrent writes. In other words, each process needs to checkout a new branch and write the data. There also needs to be a final process that merges all branches together. In terms of performance at the moment, it would be similar to creating different datasets and combining them together at a later time (without metadata handling). We are happy…

View full answer

davidbuniat · 2022-12-21T00:41:49Z

davidbuniat
Dec 21, 2022
Maintainer

Hi @Alegrowin, thanks for posing the question and starting the discussion on this topic. Personally, I am pretty excited about it! Basically you are looking for ACID transaction support on the storage level that any process can concurrently work and modify the data. We have two options.

Short-term: Currently Deep Lake is limited to multi-branch concurrent writes. In other words, each process needs to checkout a new branch and write the data. There also needs to be a final process that merges all branches together. In terms of performance at the moment, it would be similar to creating different datasets and combining them together at a later time (without metadata handling). We are happy to nail down low-hanging fruit optimizations here to make this work for you. e.g. being able to create virtual views of multiple branch datasets as if it is a single branch dataset.
Long-term: We are considering to start an initiative for supporting ACID transactions on storage level to support your use case and we would love to collaborate to make this work for you. If you are interested, let's jump on a call and discuss this further.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help needed to figure out asynchronous parallel writing during data stream #2066

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Help needed to figure out asynchronous parallel writing during data stream #2066

Alegrowin Dec 19, 2022

Replies: 1 comment

davidbuniat Dec 21, 2022 Maintainer

Alegrowin
Dec 19, 2022

davidbuniat
Dec 21, 2022
Maintainer