Help needed to figure out asynchronous parallel writing during data stream #2066
-
Hey hey! Looking for some help/direction to solve this challenge. Is it possible to write a single tensor from 2 (or more) differents processes/container at the same time? Ray could help solving the problem but since we need to fire new processes continuously as the data flows inside the system, Ray doesn't fit the streaming approach (a single point is used to distribute the transaction across many 'workers'). Our use case is closer to multiple atomical insert, update on a tensor. To demonstrate what we are trying to solve for, here an example Parallel function def add_sample(value: int, index: int):
src = 's3://bucket/src'
ds = deeplake.load(src, read_only=False)
with ds:
result = np.array([[index],[value]])
ds.important_stuff.append(result)
ds.flush() Orchestrator launching x processes (x=10 in this scenario but could be any number) src = 's3://bucket/src'
ds = deeplake.empty(src, overwrite=True)
with ds:
ds.create_tensor("important_stuff", htype="class_label")
process = []
for index in range(10):
process.append(Process(target=add_sample,args=(index,index*index)))
process[index].start()
for index in range(10):
process[index].join()
ds = deeplake.dataset(src, read_only=True, overwrite=False)
print (ds.important_stuff)
print (ds.important_stuff[0:10].numpy())
# Show only 1 element The result of this should be an Array of 10 [index, value] elements, instead only a single element is written in the S3 bucket. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @Alegrowin, thanks for posing the question and starting the discussion on this topic. Personally, I am pretty excited about it! Basically you are looking for ACID transaction support on the storage level that any process can concurrently work and modify the data. We have two options.
|
Beta Was this translation helpful? Give feedback.
Hi @Alegrowin, thanks for posing the question and starting the discussion on this topic. Personally, I am pretty excited about it! Basically you are looking for ACID transaction support on the storage level that any process can concurrently work and modify the data. We have two options.
Short-term: Currently Deep Lake is limited to multi-branch concurrent writes. In other words, each process needs to checkout a new branch and write the data. There also needs to be a final process that merges all branches together. In terms of performance at the moment, it would be similar to creating different datasets and combining them together at a later time (without metadata handling). We are happy…