Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(duckdb): duckdb read_csv return wrong schema #9981

Closed
1 task done
jitingxu1 opened this issue Aug 31, 2024 · 3 comments
Closed
1 task done

bug(duckdb): duckdb read_csv return wrong schema #9981

jitingxu1 opened this issue Aug 31, 2024 · 3 comments
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend

Comments

@jitingxu1
Copy link
Contributor

What happened?

In this example, it reads col2 as boolean

import tempfile
data = b'''
        col1,col2,col3
        f,f,f
        t,t,t
        f,f,c
'''
with tempfile.NamedTemporaryFile(delete=False) as f:
        nbytes = f.write(data)


import ibis
from ibis import _
ibis.options.interactive = True

t = con.read_csv(f.name)
t.schema()

output:

ibis.Schema {
  col1  string
  col2  boolean
  col3  string
}

It looks like treat string "f" "t" as false or true for the second column, but not the first column.

I have to do something like this

t = con.read_csv(f.name,types={'col2': dt.str})
t.schema()

output:

ibis.Schema {
  col1  string
  col2  string
  col3  string
}

Could we treat all of them as string without specify the type.

I do see dataset from one kaggle competition has string "f" or "t" in the first N rows, but has other string somewhere after N rows. I assume it infer the schema by first N rows.

The column does-bruise-or-bleed is inferred as boolean because it has all "f" and "t" in the first N rows.

ConversionException: Conversion Error: CSV Error on Line: 27848
Original Line: 27846,p,4.24,x,t,n,d,d,,o,5.8,5.61,,t,w,,,f,f,,d,u
Error when converting column "does-bruise-or-bleed". Could not convert string "d" to 'BOOLEAN'

What version of ibis are you using?

9.3.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jitingxu1 jitingxu1 added the bug Incorrect behavior inside of ibis label Aug 31, 2024
@gforsyth gforsyth added the duckdb The DuckDB backend label Sep 1, 2024
@gforsyth
Copy link
Member

gforsyth commented Sep 1, 2024

This isn't an ibis bug, nor it is a bug with the DuckDB CSV reader. CSVs aren't typed, so it's always best effort to infer the correct dtypes for a given column.

You can either do what you've done above, and specify the dtypes for the columns that aren't sniffed the way you want, or you can force reading all columns as strings by passing all_varchar=True.

All of the DuckDB CSV reading options are documented here: https://duckdb.org/docs/data/csv/overview.html#parameters

All of those are supported, but you should convert the argument types to Python syntax, e.g. True instead of 'true'

@gforsyth gforsyth closed this as not planned Won't fix, can't repro, duplicate, stale Sep 1, 2024
@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Sep 1, 2024
@jitingxu1
Copy link
Contributor Author

Thanks for the details. It is not a bug. From my side, one concern is the user experience when users switch backends. The same code cannot be used in another backend.

@cpcloud
Copy link
Member

cpcloud commented Sep 3, 2024

For text file formats without any standard for how to interpret their values (like CSV, and unlike JSON), this will always be a problem as long as people are using those formats. Humanity is unlikely to ever abandon CSV entirely, so it'll probably always be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis duckdb The DuckDB backend
Projects
Archived in project
Development

No branches or pull requests

3 participants