Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daft.read_csv with provided schema does not return dataframe with schema when has_headers=False and infer_schema=True #3517

Open
jpedrick opened this issue Dec 6, 2024 · 3 comments
Assignees
Labels
bug Something isn't working p1 Important to tackle soon, but preemptable by p0

Comments

@jpedrick
Copy link

jpedrick commented Dec 6, 2024

Describe the bug

When I provide an explicit schema to daft.read_csv the returned Dataframe ignores the provided schema and instead infers the schema.

To Reproduce

Provide a schema dictionary, but leave infer_schema as default(True).

Expected behavior

If a schema is provided it should supersede the infer_schema parameter. At minimum, if a schema is provided and infer_schema = True throw an error.

Component(s)

CSV

Additional context

No response

@jpedrick jpedrick added bug Something isn't working needs triage labels Dec 6, 2024
@kevinzwang kevinzwang added p1 Important to tackle soon, but preemptable by p0 and removed needs triage labels Dec 6, 2024
@kevinzwang
Copy link
Member

@desmondcheongzx could you take a look at this?

@jaychia
Copy link
Contributor

jaychia commented Dec 6, 2024

I believe the currently intended behavior is:

  1. infer_schema controls whether or not Daft will attempt to infer the schema from CSV
  2. schema controls a users' hints
  • If only infer_schema=True, then schema will just be inferred
  • If infer_schema=True and schema is also provided, the users' schema hints will override the inferred schema for columns where there are conflicts
  • If infer_schema=False, then the user should provide an explicit schema which will be used as the entire schema

@jpedrick
Copy link
Author

jpedrick commented Dec 7, 2024

Hi @jaychia , my expectations could be wrong, but as an end user I would expect if schema is provided I would expect no-inference unless there's an undefined column found. read_csv

From the docs, the related arguments are:

schema: dict[str, [DataType]– A schema that is used as the definitive schema for the CSV if infer_schema is False, otherwise it is used as a schema hint that is applied after the schema is inferred.
infer_schema: (bool) – Whether to infer the schema of the CSV, defaults to True.
schema_hints: Optional[Dict[str, [DataType] -- no specific documentation provided (though in the source code there is a deprecation warning)
has_headers: (bool) – Whether the CSV has a header or not, defaults to True

In my case, I supplied has_headers = False, infer_schema = True, schema = dict( mycolumn1=DataType.int32(), mycolumn2=DataType.string() ) and none of the column names were picked up from schema as hints, though the types were inferred without an issue with or without the schema.

In terms of actual functionality, my CSV doesn't have headers and I need to provide column names. I'm happy with the inferred types, but I need a way to supply column names. Something like: has_headers = False, infer_schema = True, schema=[ 'mycolumn1', 'mycolumn2'] and the resulting DataFrame has the expected column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1 Important to tackle soon, but preemptable by p0
Projects
None yet
Development

No branches or pull requests

4 participants