Skip to content

Commit

Permalink
io/read_metadata: set low_memory=False
Browse files Browse the repository at this point in the history
This suppresses the `DtypeWarnings` messages from pandas when it infers
different dtypes for a column in the metadata. We do not need pandas to
internally parse files in chunks since we already surface the `chunksize`
parameter to control memory usage. This change was motivated by internal
discussion on Slack about how these warning messages overwhelm the logs
of the ncov builds and make debugging a pain.¹

I have seen surprising memory usage in the past with `low_memory=False`
within ncov-ingest². However that was due to the unexpected interaction
with the `usecols` parameter, where the entire file was read before
being subset to the columns provided.

In the future, we may want to explicitly set the dtype to `string` for
all columns in the metadata as suggested by @tsibley in a separate PR.³
However, that will require wider changes throughout Augur where uses of
the metadata may be expecting the inferred dtypes (such as in
augur export⁴).

¹ https://bedfordlab.slack.com/archives/C0K3GS3J8/p1686671582331959?thread_ts=1685568402.393599&cid=C0K3GS3J8
² nextstrain/ncov-ingest@7bde90a
³ #1235 (comment)https://github.com/nextstrain/augur/blob/b61e3e7e969ff1b82fce5f2e2f388a10e6f3c305/augur/export_v2.py#L239-L245
  • Loading branch information
joverlee521 committed Jun 13, 2023
1 parent 7139595 commit 73508e8
Showing 1 changed file with 1 addition and 0 deletions.
1 change: 1 addition & 0 deletions augur/io/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ def read_metadata(metadata_file, delimiters=DEFAULT_DELIMITERS, id_columns=DEFAU
"engine": "c",
"skipinitialspace": True,
"na_filter": False,
"low_memory": False,
}

if chunk_size:
Expand Down

0 comments on commit 73508e8

Please sign in to comment.