io/read_metadata: ignore pandas DtypeWarning #1238

joverlee521 · 2023-06-13T23:50:48Z

This suppresses the DtypeWarnings messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the chunksize parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with low_memory=False within ncov-ingest². However that was due to the unexpected interaction with the usecols parameter, where the entire file was read before being subset to the columns provided.

In the future, we may want to explicitly set the dtype to string for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)
⁴

augur/augur/export_v2.py

Lines 239 to 245 in b61e3e7

    
           # no type supplied => try to guess 
        
           if all([all([str(x).lower() in ["false", "true", "1.0", "0.0", "1", "0", "yes", "no"] for x in trait_values])]): 
        
               t = "boolean" 
        
           elif all([ isinstance(n, float) if isinstance(n, float) else isinstance(n, int) for n in trait_values ]): 
        
               t = "continuous" 
        
           else: 
        
               t = "categorical"

Testing

What steps should be taken to test the changes you've proposed?
If you added or changed behavior in the codebase, did you update the tests, or do you need help with this?

Checklist

Add a message in CHANGES.md summarizing the changes in this PR that are end user focused. Keep headers and formatting consistent with the rest of the file.

victorlin

Added comments

victorlin · 2023-06-14T23:31:44Z

augur/io/metadata.py

I haven't tested, but does this actually suppress the DtypeWarnings when chunksize is specified? I'd think that inferred types would still be mixed as long as data is read in chunks where data is split in a way that yields differently inferred types. Or does pandas do something before reading the chunks where it loads all values in a column at once to infer data type?

From my understanding, when chunksize is specified, each chunk is read separately so pandas does not compare dtypes across the chunks. If you specified chunksize=1, you would never see any dtype warnings.

This is different than pandas internal chunking with low_memory=True, where it does compare dtypes across chunks because they get loaded into the same dataframe.

victorlin · 2023-06-14T23:32:47Z

augur/io/metadata.py

As you've noted, read_metadata surfaces the chunksize parameter (via chunk_size). However, it's only surfaced on the CLI level by augur filter --metadata-chunk-size. Does this mean other subcommands are prone to out of memory errors?

I haven't tested, but an example would be if one uses a large metadata file with augur frequencies. With the default low_memory=True, Pandas would still make this work by splitting into chunks under the hood. With low_memory=False, that wouldn't be the case.

Potential ways to address this:

Surface --metadata-chunk-size in all subcommands that call read_metadata.

Force read_metadata to always return a chunk iterator.

Does this mean other subcommands are prone to out of memory errors?

Maybe? As it stands, the other subcommands already read the entire metadata file into memory. The memory errors would come from pandas having to do type inference across the whole column rather than in chunks.

I see – if there is any file that would give memory issues with low_memory=False then it would already have memory issues with the current way it's being read in entirety.

@corneliusroemer I don't think low_memory=False will cause OOM crashes. Either (1) the data is already read in chunks to prevent OOM crashes (done in augur filter), or (2) OOM crashes would already be happening due to absence of chunking.

corneliusroemer · 2023-06-15T22:24:42Z

Bit worried this could cause OOM crashes. Can we do this in a backwards compatible manner, or allow opt-out?

joverlee521 · 2023-06-16T23:44:02Z

Since there are concerns of memory errors with setting low_memory=False, I went with an alternative method to ignore warnings outside of pandas.

codecov · 2023-06-16T23:56:38Z

Codecov Report

Patch coverage: 83.33% and project coverage change: +0.06 🎉

Comparison is base (7139595) 68.88% compared to head (6a5ac00) 68.95%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1238      +/-   ##
==========================================
+ Coverage   68.88%   68.95%   +0.06%     
==========================================
  Files          64       64              
  Lines        6939     6964      +25     
  Branches     1693     1696       +3     
==========================================
+ Hits         4780     4802      +22     
- Misses       1854     1856       +2     
- Partials      305      306       +1

Impacted Files	Coverage Δ
augur/io/metadata.py	`95.00% <83.33%> (-1.30%)`	⬇️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

victorlin

After clarification in #1238 (comment), I'd prefer the original 73508e8 over 6a5ac00.

@tsibley

This suppresses the `DtypeWarnings` messages from pandas when it infers different dtypes for a column in the metadata. We do not need pandas to internally parse files in chunks since we already surface the `chunksize` parameter to control memory usage. This change was motivated by internal discussion on Slack about how these warning messages overwhelm the logs of the ncov builds and make debugging a pain.¹ I have seen surprising memory usage in the past with `low_memory=False` within ncov-ingest². However that was due to the unexpected interaction with the `usecols` parameter, where the entire file was read before being subset to the columns provided. In the future, we may want to explicitly set the dtype to `string` for all columns in the metadata as suggested by @tsibley in a separate PR.³ However, that will require wider changes throughout Augur where uses of the metadata may be expecting the inferred dtypes (such as in augur export⁴). ¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8 ² nextstrain/ncov-ingest@7bde90a ³ #1235 (comment) ⁴ https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245

augur/io/metadata.py

joverlee521 requested a review from a team June 13, 2023 23:51

victorlin reviewed Jun 14, 2023

View reviewed changes

joverlee521 force-pushed the suppress-dtype-warning branch from 73508e8 to 6a5ac00 Compare June 16, 2023 23:39

joverlee521 changed the title ~~io/read_metadata: set low_memory=False~~ io/read_metadata: ignore pandas DtypeWarning Jun 16, 2023

joverlee521 requested a review from a team June 16, 2023 23:44

victorlin reviewed Jun 20, 2023

View reviewed changes

joverlee521 force-pushed the suppress-dtype-warning branch from 6a5ac00 to 7582db7 Compare July 7, 2023 19:05

victorlin approved these changes Jul 7, 2023

View reviewed changes

augur/io/metadata.py Show resolved Hide resolved

joverlee521 merged commit 616db55 into master Jul 7, 2023
26 checks passed

joverlee521 deleted the suppress-dtype-warning branch July 7, 2023 19:21

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io/read_metadata: ignore pandas DtypeWarning #1238

io/read_metadata: ignore pandas DtypeWarning #1238

joverlee521 commented Jun 13, 2023

victorlin left a comment •

edited

Loading

victorlin Jun 14, 2023

joverlee521 Jun 15, 2023

victorlin Jun 14, 2023

joverlee521 Jun 15, 2023

victorlin Jun 20, 2023

corneliusroemer commented Jun 15, 2023

joverlee521 commented Jun 16, 2023

codecov bot commented Jun 16, 2023

victorlin left a comment

	# no type supplied => try to guess
	if all([all([str(x).lower() in ["false", "true", "1.0", "0.0", "1", "0", "yes", "no"] for x in trait_values])]):
	t = "boolean"
	elif all([ isinstance(n, float) if isinstance(n, float) else isinstance(n, int) for n in trait_values ]):
	t = "continuous"
	else:
	t = "categorical"

io/read_metadata: ignore pandas DtypeWarning #1238

io/read_metadata: ignore pandas DtypeWarning #1238

Conversation

joverlee521 commented Jun 13, 2023

Testing

Checklist

victorlin left a comment • edited Loading

Choose a reason for hiding this comment

victorlin Jun 14, 2023

Choose a reason for hiding this comment

joverlee521 Jun 15, 2023

Choose a reason for hiding this comment

victorlin Jun 14, 2023

Choose a reason for hiding this comment

joverlee521 Jun 15, 2023

Choose a reason for hiding this comment

victorlin Jun 20, 2023

Choose a reason for hiding this comment

corneliusroemer commented Jun 15, 2023

joverlee521 commented Jun 16, 2023

codecov bot commented Jun 16, 2023

Codecov Report

victorlin left a comment

Choose a reason for hiding this comment

victorlin left a comment •

edited

Loading