[Hotfix] Change to streaming reader for CSV schema inference. (#1471)

This PR leverages pyarrow's streaming CSV reader for schema inference; instead of reading and parsing the entire CSV file to fetch the schema, this will fetch the schema from the first "block" read by the streaming reader. The block size is configurable at the pyarrow level as part of the CSV `ReadOptions` (although we don't currently expose this to the user), with a [default of 1 MB](https://github.com/apache/arrow/blob/5ad1cae024a0f3bc67ac49fa6d4d72d36afb2384/cpp/src/arrow/csv/options.h#L144-L149).
Eventual-Inc · Oct 6, 2023 · 54c666f · 54c666f
1 parent bb74530
commit 54c666f
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/daft/table/schema_inference.py b/daft/table/schema_inference.py
@@ -40,7 +40,7 @@ def from_csv(
     else:
         fs = None
     with _open_stream(file, fs) as f:
-        table = pacsv.read_csv(
+        reader = pacsv.open_csv(
             f,
             parse_options=pacsv.ParseOptions(
                 delimiter=csv_options.delimiter,
@@ -50,7 +50,7 @@ def from_csv(
             ),
         )
 
-    return Table.from_arrow(table).schema()
+    return Schema.from_pyarrow_schema(reader.schema)
 
 
 def from_json(