Accessing max length, and or min max post parsing #130

MathFrenchToast · 2023-07-21T15:47:59Z

I'm in the early stage of discovering your lib, it looks great and fast,
for prototyping some ETL like features on csv,, I'm using type inference and stream management, (to avoid memory consumption)

here are two US:

As a developper, to avoid a second parse, I want to be able on the ResultColumn to have access to some metrics on the data parsed.
e.g. in case of DataType.String, the max length, or on numeric types, the min and max
so that I can better determine how to handle the data, in my case determining the right storage.

Alternatively, as a developer I would like to have a hook to inject a my own function able to determined those metrics
but this may interfere with the single parse philosophy.

kosak · 2023-07-21T17:46:19Z

Hi, thanks for your question.

The library already goes through a lot of work to avoid a reparse. If your min/max ranges are the standard values like Short.MIN/MAX_VALUE, Integer.MIN/MAX_VALUE etc, then the library will automatically pick the smallest data type that fits all the data and give it to you. It is able to do this in a single parse pass by using the following trick. It forms a guess of the current target type (say, byte) and begins to fill the target collection. Along the way if it encounters a value that doesn't fit in a byte, it starts a new collection (of, say, short) and keeps going. Likewise for int and long if it has to. When it reaches the end of the file, it will be in one of two cases. Either it guessed right the first time and it can just give you the column, or it has a diverse set of collections. In this latter case it copies the data from those smaller collections into the final, right-sized collection, and then gives it back to you. The (almost) worst case is a column of byte-sized numbers where the last one is a long. This would cost a single parse pass through the file, and then a copy of N-1 bytes to longs at the end. Other numeric scenarios have the same worst case cost and still only a single pass through the file.

Regarding your other question about String, we don't support a max length; the only type inference there is for strings of length 1. If the Unicode string happens to consist of a single character in the Unicode Basic Multilingual Plane (basically, can be represented as a Java char) then we give you a column of char, otherwise we give you a column of String. We do the single-pass trick there too.

I said "(almost) worst case" above because the actual worst case happens when we guess numeric but then have to transition to non-numeric, like String. We can't use the same collection-copying trick because there may have been some loss in character fidelity, e.g. parsing "00016" as numeric 16. For these reasons we would take the prefix of the source that was initially guessed as numeric and we would reparse it.

So in answer to your question, we don't currently provide hooks to customize these numeric limits. As you can see the type inferencer is pretty opinionated about the way it wants to do things, and it would be pretty hard to add customization points. But if there's a feature you would like that would make sense in the above framework, let us know and maybe we can provide something.

All that said, if you're doing ETL work, you may want to try out our free and open Deephaven Community Core. It would allow you to easily load and visualize your CSV datasets, and allows a plethora of more sophisticated use cases as well. You may want to check it out at https://github.com/deephaven/deephaven-core with documentation at http://deephaven.io/core/docs .

MathFrenchToast added the enhancement New feature or request label Jul 21, 2023

rcaudy assigned kosak Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing max length, and or min max post parsing #130

Accessing max length, and or min max post parsing #130

MathFrenchToast commented Jul 21, 2023

kosak commented Jul 21, 2023

Accessing max length, and or min max post parsing #130

Accessing max length, and or min max post parsing #130

Comments

MathFrenchToast commented Jul 21, 2023

kosak commented Jul 21, 2023