Multithreaded Read and output in chunks #23

xiaodaigh · 2017-11-20T05:18:23Z

I think the SAS7BDAT format is suitable for multithreaded processing.

Outputing to CSV is a pain and I think it would good to output to a more modern format such as fst

mulya · 2018-01-31T12:22:18Z

I think it will be better to move "fst" part to another issue. @printsev what do you think about multithread reading? I think it will be good idea to make your API more "open" to help other tools use this library. For example, I know that spark-sas7bdat use reflection to get access to private fields and methods of SasFileParser class. Do you have any ideas about that?

jehugaleahsa · 2021-02-08T03:56:19Z

Another suggestion I have is from working with SAS Transport (XPORT V5) files regularly. I am not familiar with the internals of sas7bdat files, but I am assuming it's similar.

XPORT files have fixed-width columns, which means you know the length/width of every record. Once you read the schema, you can determine this length. Then, once your file position is at the start of the first record, you can efficiently jump to a record at a given offset/index. This makes tasks like simple pagination extremely efficient. In XPORT V5, there are some special cases you need to handle when records are < 80 characters due to ASCII space ( ) padding at the end of the file, but these are easy enough to handle.

This would allow multiple threads to create their own independent readers (assuming all threads have read-only access to the file). Each thread would open the file, jump to the chunk they intend to process, and run in parallel. This is much easier and efficient than introducing threading concerns into the current implementation.

At one point, I implemented my own XPORT V5 reader using RandomAccessFile and later upgraded it to use SeekableByteChannel. I don't have permission to share it with you, but it was fairly straight-forward once you have a version using InputStream working. Note this would mean taking a File or Path object instead of an InputStream, since you need to jump around the file.

The other thing to be concerned with is related to efficiency -- you don't have the equivalent of BufferInputStream, so jumping around and reading from the file all require OS interactions. To achieve good performance, you probably would want to support reading multiple records into an array (or list). You could even provide a separate implementation of the SASFileReader interface such that it wrapped the random-access file reader, and handled the buffering via said array/list internally.

public int readRecords(Object[][] records) { ... } // returns how many records could be read (can be between 0 and records.length)

Maybe if I get a spare moment, I could try to dig into your parser code and get more familiar with the SAS7BDAT format. I could probably make similar changes to support random access to that code that I made to support it for my XPORT V5 stuff.

printsev closed this as completed in 2da9d8e Aug 1, 2018

printsev reopened this Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded Read and output in chunks #23

Multithreaded Read and output in chunks #23

xiaodaigh commented Nov 20, 2017 •

edited

Loading

mulya commented Jan 31, 2018

jehugaleahsa commented Feb 8, 2021 •

edited

Loading

Multithreaded Read and output in chunks #23

Multithreaded Read and output in chunks #23

Comments

xiaodaigh commented Nov 20, 2017 • edited Loading

mulya commented Jan 31, 2018

jehugaleahsa commented Feb 8, 2021 • edited Loading

xiaodaigh commented Nov 20, 2017 •

edited

Loading

jehugaleahsa commented Feb 8, 2021 •

edited

Loading