-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
io/read_metadata: ignore pandas DtypeWarning #1238
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As you've noted, I haven't tested, but an example would be if one uses a large metadata file with Potential ways to address this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see – if there is any file that would give memory issues with @corneliusroemer I don't think
victorlin marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested, but does this actually suppress the
DtypeWarnings
whenchunksize
is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, when
chunksize
is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specifiedchunksize=1
, you would never see any dtype warnings.This is different than pandas internal chunking with
low_memory=True
, where it does compare dtypes across chunks because they get loaded into the same dataframe.