You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @jaychia , my expectations could be wrong, but as an end user I would expect if schema is provided I would expect no-inference unless there's an undefined column found. read_csv
From the docs, the related arguments are:
schema: dict[str, [DataType]– A schema that is used as the definitive schema for the CSV if infer_schema is False, otherwise it is used as a schema hint that is applied after the schema is inferred.
infer_schema: (bool) – Whether to infer the schema of the CSV, defaults to True.
schema_hints: Optional[Dict[str, [DataType] -- no specific documentation provided (though in the source code there is a deprecation warning)
has_headers: (bool) – Whether the CSV has a header or not, defaults to True
In my case, I supplied has_headers = False, infer_schema = True, schema = dict( mycolumn1=DataType.int32(), mycolumn2=DataType.string() ) and none of the column names were picked up from schema as hints, though the types were inferred without an issue with or without the schema.
In terms of actual functionality, my CSV doesn't have headers and I need to provide column names. I'm happy with the inferred types, but I need a way to supply column names. Something like: has_headers = False, infer_schema = True, schema=[ 'mycolumn1', 'mycolumn2'] and the resulting DataFrame has the expected column names.
Describe the bug
When I provide an explicit schema to
daft.read_csv
the returned Dataframe ignores the provided schema and instead infers the schema.To Reproduce
Provide a schema dictionary, but leave
infer_schema
as default(True).Expected behavior
If a schema is provided it should supersede the
infer_schema
parameter. At minimum, if a schema is provided andinfer_schema = True
throw an error.Component(s)
CSV
Additional context
No response
The text was updated successfully, but these errors were encountered: