-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question about writing parallel and group handling #172
Comments
Tensorstore automatically handles chunks in parallel, so you can just issue the write from a single thread and all relevant chunks will be handled in parallel. Groups currently aren't supported unfortunately. |
Thanks, the timing makes more sense now. As an example, If I have {10,10} chunks and I want to write to [35:45][35:45] parts. it will be divided into four parallel sections [35:40][35:40], [35:40][40:45], [40:45][35:40], and [40:45][40:45] and write 4 way parallel? And what is the maximum number of parallel writes? Can it be equal to processes? |
Exactly.
It depends on the underlying kvstore --- with the local filesystem: https://google.github.io/tensorstore/kvstore/file/index.html#json-Context.file_io_concurrency With GCS: https://google.github.io/tensorstore/kvstore/gcs/index.html#json-Context.gcs_request_concurrency I'm not sure what you mean as far as it being equal to number of processes, or how you intend to use MPI. You can indeed also write in parallel from multiple processes, but tensorstore doesn't itself contain any support for that you. You will need to partition your work between processes, and for efficiency should ensure that the partitions are aligned to chunk boundaries. |
Thanks. |
It depends. There are two mechanisms in use: thread pools and admission queues. Generally the "shared" concurrency objects create a thread-pool with a concurrency limit of at least 4 (https://github.com/google/tensorstore/blob/master/tensorstore/internal/file_io_concurrency_resource.cc). Then file io operations are queued on the thread pool for completion, and the thread pool schedules/allows the 4 operations to run in parallel. In the case of writing to gcs, the underlying kvstore uses an admission queue with a default value of 32 to control concurrency (https://github.com/google/tensorstore/blob/master/tensorstore/kvstore/gcs_http/gcs_resource.h). The admission queue then allows up to 32 parallel HTTP operations. |
Thanks again,
How can I ensure the writing process is complete? I want to ensure the data is on the disk, not the buffer or cache. |
That's an open-ended question. Have you tried logging the
You can look at how we write to tensorstore in examples/compute_percentiles.cc. Can you distill your problem into a function which you can include in a comment? |
Also there was a recent fix for sharded writing --- make sure you are using a sufficiently new version (0.1.63 or later). |
Sorry for the ambiguousness. Thanks for notifying me about shared writing.
I saw the example; I am using the same method. Zarr creates the chunk, but the chunk remains empty. |
Yes, please post your code. |
I created a repo for that.
No matter how much data I write in it the zarr file remain 4kb. |
NOTE: I reformatted the above example and removed some unnecessary lines to make it more readable; I also changed the output location from The code above is essentially correct, but I think that you merely misunderstand the data storage format. The
Since your chunk shape is 10x10, and you write into indices 0..99 x 0..99, these files have indices from 0..9. |
Can I see your code? |
I edited your above comment; that is exactly what I compiled. Are you misunderstanding the
|
Oh, I see. Sorry, I confused that with file size. |
Is there any way to turn off all the compression in zarr3? as compression is no longer enabled. |
Setting the |
Won't that be an overhead? Calling the Blosc codec and getting out of it without compression.
what you think about this? |
I'm not too familiar with Zarr3, but it looks like specifying no codec will result in uncompressed data. import tensorstore as ts
import numpy as np
spec = {
'driver': 'zarr3',
'kvstore': {
'driver': 'file',
'path': 'tmp/zarr3',
},
'create': True,
'delete_existing': True,
'metadata': {
'shape': [100, 100],
'data_type': 'float64',
"codecs": [{"name": "blosc", "configuration": {"cname": "lz4", "clevel": 9}}] # You could remove this line
}
}
tensorstore_object_result = ts.open(spec)
tensorstore_object = tensorstore_object_result.result()
data = np.random.rand(100, 100)
write_future = tensorstore_object.write(data)
# tmp/zarr3/c/0 results in 72K according to `du -sh 0`
# tmp/zarr3/c/0 results in 80K according to `du -sh 0` without the codec line
I have never used the |
Yes, you can write any of
If you want full control over the array shape, such as creating an ArrayView with the shape 3x4
|
I followed the exact instructions and I got
This is the same error I got with converting from other types. |
Sorry, I omitted the second You can see that more parameters are required by the error message. |
still I'm getting error: |
My best advice here is to (1) look for the differences between what you have and working examples, (2) try and determine what you want to be doing based on reading the code, and (3) always post your code when you want help with an error message. Otherwise anyone who tries to help is just guessing. |
My bad! I didn't notice It would be ambiguous.
And I received :
I also tested the code without changing the datatype and received the same error with: |
Write takes two parameters. But additionally, since Write is asynchronous, the array needs to have a Shared element pointer so that Write can retain a reference. Here since you are calling |
I used all the methods you suggested, and all came with an error.
Make error:
And I have no idea how to fix it. |
I can't reproduce your exact code as it's lacking some context, but I believe you have the parameters in the write incorrect. The write is expecting I believe what you want would be |
I'm sorry. I made a typo when writing the question. I am passing the data and target to the write, and it makes that error. |
Try something like:
|
Thanks.
I tried |
Can you share the code you are using for writing? |
same code as : #172 (comment) |
I'm trying to get the best writing time from the zarr3 driver. I ran the mentioned code in #172 (comment) with different numbers of limits and different system configurations. Shockingly, it will stop improvement at 16, the number of cores in one NUMA. I tried the mentioned approaches to overcome this issue, but none worked. I wanted to scale my writing to more than one node and ask how this can be handled. |
It's really hard to tell you what to do since all of your comments lack enough detail to debug and your test repo clearly hasn't been updated to include any of the suggested changes. We have no idea what your numa machine looks like, nor the characteristics of the storage, etc. (Mostly IO is going to be storage-bound, not CPU-bound, so adding additional threads is unlikely to have a large increase in write throughput). Thus we have no idea about what actual performance you see is, nor what the expectations are. If you want to run a benchmark with different options you can try our existing benchmarks replacing the default
To run them, try something like this: git clone https://github.com/google/tensorstore.git
cd tensorstore
bazelisk.py run -c opt \
//tensorstore/internal/benchmark:ts_benchmark -- \
--strategy=random \
--total_read_bytes=-10 \
--total_write_bytes=-2 \
--repeat_reads=16 \
--repeat_writes=8 \
--context_spec='{
"cache_pool": { "total_bytes_limit": 268435456 },
"file_io_concurrency": {"limit": 32}
}' \
--tensorstore_spec='{
"driver": "zarr3",
"kvstore": "file:///tmp/abc/",
"metadata": {
"shape": [16384, 16384],
"chunk_grid": {"name": "regular", "configuration": {"chunk_shape": [1024, 1024]}},
"chunk_key_encoding": {"name": "default"},
"data_type": "int32"
}
}' When I run this on my vm I see output like:
For my vm and this example, increasing from the default concurrency (32) to 64 looks like this:
And going to 128 looks like this:
That is, total throughput went down, but individual file write times went up. And clearly there is no benefit going from 64 to 128. It also appears to me that some of the issues you're running into have to do with C++ having a more difficult api to decipher and use than python. Try running your test code in python and see if that lets you iterate faster to figure out what you actually need to do, then convert that to C++. The python code is a wrapper for C++, so in theory it will be easier to test the performance of the underlying code in a faster to iterate configuration. You can also look at the comment I made here #179 (comment) which demonstrates creating a relatively large volume. I think that your best course of action is to make your above github repository into a self-contained test/benchmark which you can use to get an idea of the performance characteristics. Once that works, if there are issues, it's possible to use profiling tools to dig into the performance in a more detailed way. |
That's a nice update. Given the code above, tensorstore is writing ~ 1 Gigabyte across 128 files. While there aren't numbers, the chart indicates that you're getting around 2 Gigabytes/second. For example, a Samsung 970 Pro SSD can sustain about 2.8 Gb/s. See Tom's Hardware SSD Benchmark page for an idea about SSD performance; The copy (MB/s) column is probably close to the performance you'd expect in real-world applications. Also, can you try running the benchmark bundled with tensorstore on your machine using the same configuration as I've posted above, which is Zarr3, 1GB volume, etc? I'd be interested in seeing the output of the rates and the metrics. Edit:
This expectation is not warranted. The OS schedules writes to a disk and it's easily possible to saturate the IO bandwidth with a smaller number of writes. |
Here is the numbers for time: y = [4.78404, 2.37955,1.38456, 0.93017, 0.73465, 0.77801, 0.72834, 0.60902]
I'm trying to figure it out now. I'm running on Delta HPC. If I increase the stripe count to the number of available OSTs, will that prevent IO saturation? |
To diagnose the compiler error I may need more of a log file, but that shouldn't happen. Can you share the log when building all of tensorstore along with your gcc/clang version?
By delta, you mean this: https://www.ncsa.illinois.edu/research/project-highlights/delta/ Your tensorstore spec is writing to the local NVME disk and the throughput seems pretty reasonable for local storage. I don't know how the Lustre filesystem works, however it "provides a POSIX *standard-compliant UNIX file system interface." So there is some local path that you could use for tensorstore writes, something like |
I'm using the tensorstore C++ API to write a Zarr file, and I need to address these two issue in my code:
1- I want to create a group like Zarr Python and create a hierarchy of namespace and datasets. Right now, I'm handling it using kvstore path while creating them. What is the optimal way? This approach takes more time than the Python version!
2- I'm trying to write data into each chunk parallelly. I wanted to avoid multithreading for now and use mpi instead. I'm writing everything from scratch for this purpose as I couldn't find anything in the documentation. Is there any better way to avoid multithreading?
Thanks
The text was updated successfully, but these errors were encountered: