daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

jpedrick · 2024-12-06T21:42:22Z

Describe the bug

When I provide an explicit schema to daft.read_csv the returned Dataframe ignores the provided schema and instead infers the schema.

To Reproduce

Provide a schema dictionary, but leave infer_schema as default(True).

Expected behavior

If a schema is provided it should supersede the infer_schema parameter. At minimum, if a schema is provided and infer_schema = True throw an error.

Component(s)

CSV

Additional context

No response

The text was updated successfully, but these errors were encountered:

kevinzwang · 2024-12-06T22:02:46Z

@desmondcheongzx could you take a look at this?

jaychia · 2024-12-06T23:31:41Z

I believe the currently intended behavior is:

infer_schema controls whether or not Daft will attempt to infer the schema from CSV
schema controls a users' hints

If only infer_schema=True, then schema will just be inferred
If infer_schema=True and schema is also provided, the users' schema hints will override the inferred schema for columns where there are conflicts
If infer_schema=False, then the user should provide an explicit schema which will be used as the entire schema

jpedrick · 2024-12-07T00:15:50Z

Hi @jaychia , my expectations could be wrong, but as an end user I would expect if schema is provided I would expect no-inference unless there's an undefined column found. read_csv

From the docs, the related arguments are:

schema: dict[str, [DataType]– A schema that is used as the definitive schema for the CSV if infer_schema is False, otherwise it is used as a schema hint that is applied after the schema is inferred.
infer_schema: (bool) – Whether to infer the schema of the CSV, defaults to True.
schema_hints: Optional[Dict[str, [DataType] -- no specific documentation provided (though in the source code there is a deprecation warning)
has_headers: (bool) – Whether the CSV has a header or not, defaults to True

In my case, I supplied has_headers = False, infer_schema = True, schema = dict( mycolumn1=DataType.int32(), mycolumn2=DataType.string() ) and none of the column names were picked up from schema as hints, though the types were inferred without an issue with or without the schema.

In terms of actual functionality, my CSV doesn't have headers and I need to provide column names. I'm happy with the inferred types, but I need a way to supply column names. Something like: has_headers = False, infer_schema = True, schema=[ 'mycolumn1', 'mycolumn2'] and the resulting DataFrame has the expected column names.

jpedrick added bug Something isn't working needs triage labels Dec 6, 2024

kevinzwang added p1 Important to tackle soon, but preemptable by p0 and removed needs triage labels Dec 6, 2024

kevinzwang assigned desmondcheongzx Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

jpedrick commented Dec 6, 2024

kevinzwang commented Dec 6, 2024

jaychia commented Dec 6, 2024

jpedrick commented Dec 7, 2024 •

edited

Loading

daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

Comments

jpedrick commented Dec 6, 2024

Describe the bug

To Reproduce

Expected behavior

Component(s)

Additional context

kevinzwang commented Dec 6, 2024

jaychia commented Dec 6, 2024

jpedrick commented Dec 7, 2024 • edited Loading

jpedrick commented Dec 7, 2024 •

edited

Loading