Add explode and / or dynamic model / schema #481

shcheklein · 2024-09-27T01:19:34Z

Follow up https://github.com/iterative/dvcx/pull/1368
Based also on this discussion / feedback by @tibor-mach https://iterativeai.slack.com/archives/C04A9RWEZBN/p1727194987119179
Base also on iteration on DCLM - https://github.com/iterative/studio/issues/10596

Summary

When we have a single file (JSONL or CVS/Parquet with a column with JSONs) we need a way to "explode" those JSONs/dicts into a Pythonic model and store it in DataChain not a single column, but as multiple columns - one per each path in that JSON/dict.

E.g. this is how JSONL looks like after a naive parse:

Or from the CVS file (mind the meta column):

There is an obvious way to mitigate this - create a Model class and populate it from in the UDF that. But that's seems very annoying and redundant - model description becomes 2x/3x code of the parser.

Suggestions

DataChain.explode(C("meta")). This one is more or less obvious and requires creating an extra table.
Make functions like map, gen dynamically figure out schema and create Pydantic model as it is parsing files. This requires more complicated implementation, but can faster since it can work in a streaming mode:

Imagine something like this:

def extract(file: File) -> Iterator[File, dict]:
    with file.open() as f:
        dctx = zstd.ZstdDecompressor()
        stream_reader = dctx.stream_reader(f)
        text_stream = io.TextIOWrapper(stream_reader, encoding='utf-8')
        for line in text_stream:
            yield file, json.parse(line)


DataChain.from_dataset("index").settings(cache=True).limit(1).gen(extract).save("raw_text")

The text was updated successfully, but these errors were encountered:

skshetry · 2024-09-27T04:27:00Z

This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).

How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?

shcheklein · 2024-09-27T17:57:13Z

How are we going to determine schema? Is it based on a sample of rows (which is what read_meta does, it reads a single row) or by reading all the rows?

yes, based on a sample (like we do already in the from_parquet and friends)

This may be already possible with the combination of read_meta and map depending on how we want to solve this. (read_meta requires a storage_path with an example data to create a schema or a pydantic model passed).

yes, idea is the same but we need to wrap it into a user-friendly function and may be generalize a bit?

shcheklein self-assigned this Sep 27, 2024

shcheklein added the priority-p1 label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add explode and / or dynamic model / schema #481

Add explode and / or dynamic model / schema #481

shcheklein commented Sep 27, 2024

skshetry commented Sep 27, 2024 •

edited

Loading

shcheklein commented Sep 27, 2024

Add explode and / or dynamic model / schema #481

Add explode and / or dynamic model / schema #481

Comments

shcheklein commented Sep 27, 2024

Summary

Suggestions

skshetry commented Sep 27, 2024 • edited Loading

shcheklein commented Sep 27, 2024

skshetry commented Sep 27, 2024 •

edited

Loading