Skip to content

Commit

Permalink
[Hotfix] Change to streaming reader for CSV schema inference. (#1471)
Browse files Browse the repository at this point in the history
This PR leverages pyarrow's streaming CSV reader for schema inference;
instead of reading and parsing the entire CSV file to fetch the schema,
this will fetch the schema from the first "block" read by the streaming
reader. The block size is configurable at the pyarrow level as part of
the CSV `ReadOptions` (although we don't currently expose this to the
user), with a [default of 1
MB](https://github.com/apache/arrow/blob/5ad1cae024a0f3bc67ac49fa6d4d72d36afb2384/cpp/src/arrow/csv/options.h#L144-L149).
  • Loading branch information
clarkzinzow authored Oct 6, 2023
1 parent bb74530 commit 54c666f
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions daft/table/schema_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def from_csv(
else:
fs = None
with _open_stream(file, fs) as f:
table = pacsv.read_csv(
reader = pacsv.open_csv(
f,
parse_options=pacsv.ParseOptions(
delimiter=csv_options.delimiter,
Expand All @@ -50,7 +50,7 @@ def from_csv(
),
)

return Table.from_arrow(table).schema()
return Schema.from_pyarrow_schema(reader.schema)


def from_json(
Expand Down

0 comments on commit 54c666f

Please sign in to comment.