Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io/read_metadata: ignore pandas DtypeWarning #1238

Merged
merged 1 commit into from
Jul 7, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions augur/io/metadata.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested, but does this actually suppress the DtypeWarnings when chunksize is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, when chunksize is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specified chunksize=1, you would never see any dtype warnings.

This is different than pandas internal chunking with low_memory=True, where it does compare dtypes across chunks because they get loaded into the same dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've noted, read_metadata surfaces the chunksize parameter (via chunk_size). However, it's only surfaced on the CLI level by augur filter --metadata-chunk-size. Does this mean other subcommands are prone to out of memory errors?

I haven't tested, but an example would be if one uses a large metadata file with augur frequencies. With the default low_memory=True, Pandas would still make this work by splitting into chunks under the hood. With low_memory=False, that wouldn't be the case.

Potential ways to address this:

  1. Surface --metadata-chunk-size in all subcommands that call read_metadata.
  2. Force read_metadata to always return a chunk iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean other subcommands are prone to out of memory errors?

Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see – if there is any file that would give memory issues with low_memory=False then it would already have memory issues with the current way it's being read in entirety.

@corneliusroemer I don't think low_memory=False will cause OOM crashes. Either (1) the data is already read in chunks to prevent OOM crashes (done in augur filter), or (2) OOM crashes would already be happening due to absence of chunking.

victorlin marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def read_metadata(metadata_file, delimiters=DEFAULT_DELIMITERS, id_columns=DEFAU
"engine": "c",
"skipinitialspace": True,
"na_filter": False,
"low_memory": False,
}

if chunk_size:
Expand Down