-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multithreaded Read and output in chunks #23
Comments
I think it will be better to move "fst" part to another issue. @printsev what do you think about multithread reading? I think it will be good idea to make your API more "open" to help other tools use this library. For example, I know that spark-sas7bdat use reflection to get access to private fields and methods of SasFileParser class. Do you have any ideas about that? |
Another suggestion I have is from working with SAS Transport (XPORT V5) files regularly. I am not familiar with the internals of sas7bdat files, but I am assuming it's similar. XPORT files have fixed-width columns, which means you know the length/width of every record. Once you read the schema, you can determine this length. Then, once your file position is at the start of the first record, you can efficiently jump to a record at a given offset/index. This makes tasks like simple pagination extremely efficient. In XPORT V5, there are some special cases you need to handle when records are < 80 characters due to ASCII space ( This would allow multiple threads to create their own independent readers (assuming all threads have read-only access to the file). Each thread would open the file, jump to the chunk they intend to process, and run in parallel. This is much easier and efficient than introducing threading concerns into the current implementation. At one point, I implemented my own XPORT V5 reader using RandomAccessFile and later upgraded it to use SeekableByteChannel. I don't have permission to share it with you, but it was fairly straight-forward once you have a version using The other thing to be concerned with is related to efficiency -- you don't have the equivalent of public int readRecords(Object[][] records) { ... } // returns how many records could be read (can be between 0 and records.length) Maybe if I get a spare moment, I could try to dig into your parser code and get more familiar with the SAS7BDAT format. I could probably make similar changes to support random access to that code that I made to support it for my XPORT V5 stuff. |
I think the SAS7BDAT format is suitable for multithreaded processing.
Outputing to CSV is a pain and I think it would good to output to a more modern format such as fst
The text was updated successfully, but these errors were encountered: