-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Parallel file reading #239
Comments
This is an interesting request (the second in one week). Back when I parallelised Mksquashfs for the first time in about 2006 I did extensive experiments reading the source filesystem using one thread and multiple threads. These experiments showed the maximum performance was obtained with a single read thread (and so you're right that there is only one reader thread). But this was in the days of mechanical hard drives with slow seeking, and the results were not that surprising. By and large anything which caused seeking (including parallel reading of files) produced worse performance. Modern hardware including RAID (*) and SSD drives may have changed the situation. So I'll add this to the list of enhancements and see if priorities allow it to be looked at for the next release. (*) RAID has been around since the late 1980s. In fact I implemented a block striping RAID system in 1991. But they have become more and more widespread in recent years. As far as RAID is concerned I assume these systems are using block striping rather than bit-striping otherwise there should not be an issue. Also as readahead should kick in for large files utilising all the disks with block striping, I assume the issue is with small files which do not benefit from readahead. |
I'm interested in this feature too :) |
You may have noticed that I have pushed the parallel file reading improvements to the branch reader_improvements. The code in that branch implements the ability to have up to 128 parallel reader threads (the 128 limit is arbitrary, and can be increased). By default the code uses a conservative six parallel reader threads, split into three "small file reader threads" and three "multi-block reader threads". The default amount of reader threads can be changed with the following (currently undocumented) options:
The difference between a small-reader-thread and a block-reader-thread is that a small-reader-thread only reads files smaller than the block size, and a block reader thread only reads files that are a block size or larger. The split is due to performance testing which showed distinguishing reader threads in this way was useful and can increase performance. The rationale for this will be documented/discussed later. Can these parallel reader threads increase performance? In testing, whether they increase performance is entirely dependent on a vast combination of variables, such as amount of processors, I/O speed of the media, and the mix of small files and large files in the source files to be compressed. Generally speaking, if you have not got many processors, a fast medium and files that are large, then a single reader thread hasn't been your bottleneck, and your bottleneck is the speed of compression, which won't change. On the other hand if you have a fast machine, the media benefits from parallel reading, and files are small, then a single reader thread has been a major bottleneck limiting the performance of Mksquashfs, and I have cases where Mksquashfs is more than six times faster. In addition, in testing I have not found an optimal setting for the above options, as you might expect the optimal settings are highly dependent on the previously mentioned variables. The smaller the files and the more parallel the media, the greater performance can be obtained from a large number of parallel reader threads, especially the small-reader-threads. Experimentation with the options seems to be necessary to get the optimal performance. The following test matrix was generated by running Mksquashfs over a source filesystem consisting of 128 byte files, then 256 byte files, 512 byte files etc, with the amount of parallel reader threads varied from 1 (single reader thread), 1:1 (one small reader thread and one block thread) to 62:62 (62 small reader threads, and 62 block reader threads). The amount of data was same in all cases (i.e. the same amount of data was split across 128 byte files, 256 byte files etc.) The machine has 14 cores/20 threads. The media was a SanDisk Extreme 55AE SSD connected via USB 3.
For instance with 128 byte files, a single reader thread took 58 minutes, and 62 small reader threads took 7 minutes. Also with 4K files, a single reader thread took 1 minute 42 seconds, and 62 small reader threads took 10 seconds. The larger the files, the smaller the improvement is obtained with parallel reader threads. But 512K files still obtain an improvement. Please test, and I'll be interested in what feedback you have. |
Hi, Thanks a lot for your work! I've made a few quick tests. My main use case is for packaging datasets to be stored on magnetic tape. This media heavily discourages small files (<2GiB), so that making archives/zips/tars/isos/squashfs is fundamental to have efficient read/write operations. Test dataset: ~50 GB, ~750 files. 480 of those files are ~100MB, 122 files < 128k (block size) First test: default mkquashfs, gzip compression real 3m9.225s Second test: parallel mkquashfs, gzip compression, 64:64 readers. x3.4 TIMES FASTER real 0m42.996s I think the improvement is pretty substantial. Are there any plans to merge this feature? On another topic, I saw that the zstd compression is not supported (yet?). Is there an inherent limitation on which compression algorithms may be supported using the parallel readers? Thanks again! |
I hit and fixed a deadlock last week, and so I'm going to continue testing for a couple of days, and then hopefully merge it.
Hmm, by default only gzip is enabled in the Makefile. To enable zstd (and the other compression algorithms) you can edit the Makefile and uncomment the lines, e.g. #ZSTD_SUPPORT = 1 becomes ZSTD_SUPPORT = 1 Or you can enable it on the Make command line, e.g. CONFIG=1 ZSTD_SUPPORT=1 make Historically the reason for this is because not all distros had support for the more modern compression algorithms. But these are probably quite rare now, and so it should default to building all the compression algorithms. Thanks for testing and the feedback. A speedup of x3.4 in a real-world scenario was more than I was expecting! |
That's great!
Oh, perfect, then I'll test tomorrow again with zstd.
I'm more than happy to help test this. I have data to pack these days, and I'll do some more tests on other real life data. Thanks again! |
Hi, Another real case scenario. I attach a plot with the size distribution. Written to RAM as before. In this case, because of the large amounts of files and the network latency, it takes a long time to read all the data. Serial, zstd: real 12m44.803s Parallel, 64:64, x5.77 TIMES FASTER real 1m52.950s BTW, Could I test with even MORE reader threads? I'm pretty sure our storage system can cope with ~500 small reader threads. |
I'm using squashfs as part of rules_appimage. Contents are mostly executable code (lots of ELF and .py files). My squashfs archives look mostly like this: Code to generate plot (thx ChatGPT)import os
import sys
import matplotlib.pyplot as plt
# Check if directory is passed as an argument
if len(sys.argv) < 2:
print("Usage: python3 file_size_distribution.py /path/to/directory")
sys.exit(1)
# Get directory from command-line argument
directory = sys.argv[1]
# Check if the directory exists
if not os.path.isdir(directory):
print(f"Error: Directory '{directory}' not found.")
sys.exit(1)
# Define size buckets and initialize counters
buckets = {
'<10 B': 0,
'10 B': 0,
'100 B': 0,
'1 KB': 0,
'10 KB': 0,
'100 KB': 0,
'1 MB': 0,
'10 MB': 0,
'>100 MB': 0
}
total_files = 0
total_size = 0
# Scan directory and categorize file sizes
for root, dirs, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
try:
size = os.path.getsize(file_path)
total_files += 1
total_size += size
if size < 10:
buckets['<10 B'] += 1
elif size < 100:
buckets['10 B'] += 1
elif size < 1000:
buckets['100 B'] += 1
elif size < 10000:
buckets['1 KB'] += 1
elif size < 100000:
buckets['10 KB'] += 1
elif size < 1000000:
buckets['100 KB'] += 1
elif size < 10000000:
buckets['1 MB'] += 1
elif size < 100000000:
buckets['10 MB'] += 1
else:
buckets['>100 MB'] += 1
except (FileNotFoundError, PermissionError):
pass # Skip files that can't be accessed
# Convert total size to human-readable format
def human_readable_size(size):
for unit in ['B', 'KB', 'MB', 'GB', 'TB']:
if size < 1024:
return f"{size:.1f} {unit}"
size /= 1024
return f"{size:.1f} PB"
total_size_hr = human_readable_size(total_size)
# Prepare data for plotting
labels = list(buckets.keys())
counts = list(buckets.values())
# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(labels, counts, color='blue')
plt.xlabel('File Size')
plt.ylabel('Number of Files')
plt.title('File Size Distribution')
plt.suptitle(f"Total Files: {total_files} | Total Size: {total_size_hr}", fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.xticks(rotation=45)
plt.tight_layout()
# Save and show the plot
output_file = 'file_size_distribution.png'
plt.savefig(output_file)
plt.show()
print(f"Plot saved as {output_file}") I extracted and re-packed this squashfs on a Ryzen 7 3800X 8-Core Processor (16 threads) with the mksquashfs from this branch.
I'm not sure if I'm missing some important piece here or if my setup just doesn't benefit from parallel reads. |
Hi @lalten, I think there are a few factors that may explain why you don't get much difference.
The use cases mostly benefiting from parallel readers should be (a) lots of small files, and (b) on a remote storage :) |
Np, I have increased the maximum to 8192 small reader threads (and 8192 block reader threads) in this commit The commit message has some important information about maximum open file limits, and what happens when you hit them
|
If the files are cached in memory (which is likely here) then what you're measuring is the speed of memory access which is very fast even with a single read thread. The CPU times of around 1500% (or 93.75% of total) percent show they're maxed out. To test the actual speed from SSD, before each time you run Mksquashfs, you should flush the caches by running the following as root % echo 3 > /proc/sys/vm/drop_caches |
Does the increase in threads improve performance? I won't be disappointed if it didn't! But it will be interesting for me to know the results. Thanks. |
Status update ... Further extensive pre-release testing brought up another unexpected issue. I have spent the last couple of days tracking down this new issue. About 30% of the core of Mksquashfs has been rewritten to deal with parallel reader threads, and unfortunately as a result this does require extensive pre-release testing to pick up any unknown issues. For your information the "new" issue turned out to be a "benign" bug that has been in Mksquashfs for 17 years without being discovered. The reason why this bug has cropped up now is because the parallel reader threads make it about 1000 times more likely to occur. In short testing a large filesystem (about 160Gbytes, 4.2 million files and 1.5 million duplicates), occasionally the number of reported duplicates would be less than expected by between 1 and 5 files, that is at worst 1,656,263 duplicates rather than the correct 1,625,268 files. The effect of this is a correct filesystem, but where 1 - 5 files have not been seen as duplicates and so stored twice, which means a loss of compression. This is why the bug is relatively benign because it doesn't result in an incorrect filesystem. As I said this turns out to be a race condition introduced in 2008, which is almost impossible to hit with a single reader thread and so has never shown up before, but with about 10 or more parallel reader threads it does occur rarely (here about 5 in 4.2 million files). The fix is relatively easy to do, and I'll be working on that tomorrow. After a week of testing this is the only issue to show up, and so hopefully I'll be able to merge it later this week. |
Hi @plougher, I think in order to benefit from many many threads you would need some kind of crappy storage setup but also a powerful machine to handle so many threads. A massive storage mounted on a remote location with high latency and also lots of disks to handle the parallelism, and also millions of tiny files. I pity anyone that should work this scenario ;) In my case, I could not measure much difference. Using other tools for parallel reading, our system can hold up to 700-800 threads, depending on file size. (If you are curious, webdav transfers Zürich-Barcelona using 700 threads going about 500 MiB/s). So, I don't think I can benefit from more than 1k threads. Thanks again! Pau. |
Hi @ptallada
Np, thanks for testing. It's good to know anything over 1k threads isn't going to be of benefit. I spent a couple of weeks in rather lovely Zurich (sorry I don't how to do the umlaut) about 20 years ago. My brother was involved with ETH Zurich at the time and so we made it a vacation. Out of amusement we stayed at the Hotel Bristol, Bristol obviously sounds exotic in Zurich, but for us it was hardly exotic being our nearest big city. The only memorable mishap was I was learning German at the time, and often confused words. So one lunchtime I meant to order two glasses of the local wine, and instead ordered two bottles, and was too embarrassed to correct my mistake. I don't normally drink a bottle of wine at lunchtime. I have stayed up all night working on this bug, and so I off to bed. Thanks for all your feedback, |
Data comes from there, huge dark-matter simulations ;)
Hahahah, good to know :P |
This merges the work to add parallel reader threads to Mksquashfs. This is discussed in issue #239 In particular for more information about the new options, and the possible speed improvements see comment #239 (comment) Signed-off-by: Phillip Lougher <[email protected]>
Currently mksquashfs seems to use a single reader thread.
Many current devices only achieve optimal throughput when files are read from them in parallel:
Could
mksquashfs
add (configurable) threaded reading?Thanks!
The text was updated successfully, but these errors were encountered: