Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io/read_metadata: ignore pandas DtypeWarning #1238

Merged
merged 1 commit into from
Jul 7, 2023

Conversation

joverlee521
Copy link
Contributor

This suppresses the DtypeWarnings messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the chunksize parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with low_memory=False within ncov-ingest². However that was due to the unexpected interaction with the usecols parameter, where the entire file was read before being subset to the columns provided.

In the future, we may want to explicitly set the dtype to string for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)

augur/augur/export_v2.py

Lines 239 to 245 in b61e3e7

# no type supplied => try to guess
if all([all([str(x).lower() in ["false", "true", "1.0", "0.0", "1", "0", "yes", "no"] for x in trait_values])]):
t = "boolean"
elif all([ isinstance(n, float) if isinstance(n, float) else isinstance(n, int) for n in trait_values ]):
t = "continuous"
else:
t = "categorical"

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

@joverlee521 joverlee521 requested a review from a team June 13, 2023 23:51
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested, but does this actually suppress the DtypeWarnings when chunksize is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, when chunksize is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specified chunksize=1, you would never see any dtype warnings.

This is different than pandas internal chunking with low_memory=True, where it does compare dtypes across chunks because they get loaded into the same dataframe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you've noted, read_metadata surfaces the chunksize parameter (via chunk_size). However, it's only surfaced on the CLI level by augur filter --metadata-chunk-size. Does this mean other subcommands are prone to out of memory errors?

I haven't tested, but an example would be if one uses a large metadata file with augur frequencies. With the default low_memory=True, Pandas would still make this work by splitting into chunks under the hood. With low_memory=False, that wouldn't be the case.

Potential ways to address this:

  1. Surface --metadata-chunk-size in all subcommands that call read_metadata.
  2. Force read_metadata to always return a chunk iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean other subcommands are prone to out of memory errors?

Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see – if there is any file that would give memory issues with low_memory=False then it would already have memory issues with the current way it's being read in entirety.

@corneliusroemer I don't think low_memory=False will cause OOM crashes. Either (1) the data is already read in chunks to prevent OOM crashes (done in augur filter), or (2) OOM crashes would already be happening due to absence of chunking.

@corneliusroemer
Copy link
Member

Bit worried this could cause OOM crashes. Can we do this in a backwards compatible manner, or allow opt-out?

@joverlee521 joverlee521 changed the title io/read_metadata: set low_memory=False io/read_metadata: ignore pandas DtypeWarning Jun 16, 2023
@joverlee521
Copy link
Contributor Author

Since there are concerns of memory errors with setting low_memory=False, I went with an alternative method to ignore warnings outside of pandas.

@joverlee521 joverlee521 requested a review from a team June 16, 2023 23:44
@codecov
Copy link

codecov bot commented Jun 16, 2023

Codecov Report

Patch coverage: 83.33% and project coverage change: +0.06 🎉

Comparison is base (7139595) 68.88% compared to head (6a5ac00) 68.95%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1238      +/-   ##
==========================================
+ Coverage   68.88%   68.95%   +0.06%     
==========================================
  Files          64       64              
  Lines        6939     6964      +25     
  Branches     1693     1696       +3     
==========================================
+ Hits         4780     4802      +22     
- Misses       1854     1856       +2     
- Partials      305      306       +1     
Impacted Files Coverage Δ
augur/io/metadata.py 95.00% <83.33%> (-1.30%) ⬇️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After clarification in #1238 (comment), I'd prefer the original 73508e8 over 6a5ac00.

This suppresses the `DtypeWarnings` messages from pandas when it infers
different dtypes for a column in the metadata. We do not need pandas to
internally parse files in chunks since we already surface the `chunksize`
parameter to control memory usage. This change was motivated by internal
discussion on Slack about how these warning messages overwhelm the logs
of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with `low_memory=False`
within ncov-ingest². However that was due to the unexpected interaction
with the `usecols` parameter, where the entire file was read before
being subset to the columns provided.

In the future, we may want to explicitly set the dtype to `string` for
all columns in the metadata as suggested by @tsibley in a separate PR.³
However, that will require wider changes throughout Augur where uses of
the metadata may be expecting the inferred dtypes (such as in
augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
augur/io/metadata.py Show resolved Hide resolved
@joverlee521 joverlee521 merged commit 616db55 into master Jul 7, 2023
26 checks passed
@joverlee521 joverlee521 deleted the suppress-dtype-warning branch July 7, 2023 19:21
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants