Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded Read and output in chunks #23

Open
xiaodaigh opened this issue Nov 20, 2017 · 2 comments
Open

Multithreaded Read and output in chunks #23

xiaodaigh opened this issue Nov 20, 2017 · 2 comments

Comments

@xiaodaigh
Copy link

xiaodaigh commented Nov 20, 2017

I think the SAS7BDAT format is suitable for multithreaded processing.

Outputing to CSV is a pain and I think it would good to output to a more modern format such as fst

@mulya
Copy link

mulya commented Jan 31, 2018

I think it will be better to move "fst" part to another issue. @printsev what do you think about multithread reading? I think it will be good idea to make your API more "open" to help other tools use this library. For example, I know that spark-sas7bdat use reflection to get access to private fields and methods of SasFileParser class. Do you have any ideas about that?

@printsev printsev reopened this Aug 1, 2018
@jehugaleahsa
Copy link

jehugaleahsa commented Feb 8, 2021

Another suggestion I have is from working with SAS Transport (XPORT V5) files regularly. I am not familiar with the internals of sas7bdat files, but I am assuming it's similar.

XPORT files have fixed-width columns, which means you know the length/width of every record. Once you read the schema, you can determine this length. Then, once your file position is at the start of the first record, you can efficiently jump to a record at a given offset/index. This makes tasks like simple pagination extremely efficient. In XPORT V5, there are some special cases you need to handle when records are < 80 characters due to ASCII space ( ) padding at the end of the file, but these are easy enough to handle.

This would allow multiple threads to create their own independent readers (assuming all threads have read-only access to the file). Each thread would open the file, jump to the chunk they intend to process, and run in parallel. This is much easier and efficient than introducing threading concerns into the current implementation.

At one point, I implemented my own XPORT V5 reader using RandomAccessFile and later upgraded it to use SeekableByteChannel. I don't have permission to share it with you, but it was fairly straight-forward once you have a version using InputStream working. Note this would mean taking a File or Path object instead of an InputStream, since you need to jump around the file.

The other thing to be concerned with is related to efficiency -- you don't have the equivalent of BufferInputStream, so jumping around and reading from the file all require OS interactions. To achieve good performance, you probably would want to support reading multiple records into an array (or list). You could even provide a separate implementation of the SASFileReader interface such that it wrapped the random-access file reader, and handled the buffering via said array/list internally.

public int readRecords(Object[][] records) { ... } // returns how many records could be read (can be between 0 and records.length)

Maybe if I get a spare moment, I could try to dig into your parser code and get more familiar with the SAS7BDAT format. I could probably make similar changes to support random access to that code that I made to support it for my XPORT V5 stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants