fix: don't load all of an i2b2 file into memory #108

mikix · 2022-12-21T18:58:34Z

Description

The primary change in this commit is to stop loading i2b2 input files
all at once, but rather stream them in, in chunks determined by the
--batch-size parameter.

But this commit also includes several small fixes:

Fixes location of MS tool during CI
Adds comma-formatting to a lot of progress-count prints
Continues ETL even if cTAKES can't process one message (just logs
the error instead)
Changes default batch size from 10M to 200k. This works more
reliably for small-memory (8G) machines. The previous number was
optimized for the size of the resulting parquet files. This number
is optimized for memory during the run, which feels like a safer
default.
When using --input-format=ndjson and pointing at a local folder,
we now still use a temporary folder and copy in just the resource
ndjson files we want. This is to speed up the MS deid tool, so it
doesn't have to read all possible ndjson inputs.
Add better progress messaging while reading i2b2 files.

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

mikix · 2022-12-27T15:29:20Z

cumulus/loaders/i2b2/external_mappings.py

@@ -0,0 +1,75 @@
+"""Mappings of various external coding systems to the concepts they represent"""


I moved this file out of resources/ -- resources is a name we usually use for folders with data, not code.

The changes I made here are adding Outpatient to SNOMED_ADMISSION and separating the CDC_RACES and CDC_ETHNICITY dictionaries.

that's fine - i mostly just left this in the same place as the extant fhir_template out of an abundance of caution

dogversioning · 2022-12-28T14:23:06Z

cumulus/loaders/i2b2/external_mappings.py

@@ -0,0 +1,75 @@
+"""Mappings of various external coding systems to the concepts they represent"""


that's fine - i mostly just left this in the same place as the extant fhir_template out of an abundance of caution

dogversioning · 2022-12-28T14:37:34Z

cumulus/loaders/i2b2/extract.py

 from cumulus.loaders.i2b2.schema import ObservationFact, PatientDimension, VisitDimension


-def extract_csv(path_csv: str, sample=1.0) -> pandas.DataFrame:
+def extract_csv(path_csv: str, batch_size: int) -> Iterator[dict]:


I think this is generally a much better design than the previous iteration, but a potential optimization if this is still the bottleneck - rather than directly iterating over all the chunks, try a map/reduce against the resulting dataframe?

Yeah fair -- I think as part of #109, a map/reduce style will have to happen to large parts of the code. My brain isn't natively cloud-ified yet, so I didn't think to start that here.

this is not even like a cloud comment per se - i think the deal is that at some point in the dependency chain dataframe map/reduce hits the numpy cython map/reduce, which through Dark Compiler Magics is like the fastest thing you can do in python for accessing inside of large datasets.

Yeah yeah I just meant "cloud thinking" / massive-data-at-scale thinking, etc

The primary change in this commit is to stop loading i2b2 input files all at once, but rather stream them in, in chunks determined by the --batch-size parameter. But this commit also includes several small fixes: - Fixes location of MS tool during CI - Adds comma-formatting to a lot of progress-count prints - Continues ETL even if cTAKES can't process one message (just logs the error instead) - Changes default batch size from 10M to 200k. This works more reliably for small-memory (8G) machines. The previous number was optimized for the size of the resulting parquet files. This number is optimized for memory during the run, which feels like a safer default. - When using --input-format=ndjson and pointing at a local folder, we now still use a temporary folder and copy in just the resource ndjson files we want. This is to speed up the MS deid tool, so it doesn't have to read all possible ndjson inputs. - Add better progress messaging while reading i2b2 files. - Separate out race & ethnicity from i2b2, which combines them - Properly set DocumentReference.type and status

mikix force-pushed the mikix/iterating branch 4 times, most recently from 0f6b309 to 418e8bb Compare December 27, 2022 14:23

mikix changed the title ~~WIP~~ fix: don't load all of an i2b2 file into memory Dec 27, 2022

mikix force-pushed the mikix/iterating branch 3 times, most recently from 7a5ef75 to 27664f6 Compare December 27, 2022 15:16

mikix commented Dec 27, 2022

View reviewed changes

mikix force-pushed the mikix/iterating branch 5 times, most recently from f3e8910 to 59edeb4 Compare December 27, 2022 19:28

dogversioning self-requested a review December 28, 2022 14:09

dogversioning approved these changes Dec 28, 2022

View reviewed changes

mikix force-pushed the mikix/iterating branch from 59edeb4 to b104b07 Compare December 28, 2022 17:08

mikix merged commit 395a1f1 into main Dec 28, 2022

mikix deleted the mikix/iterating branch December 28, 2022 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't load all of an i2b2 file into memory #108

fix: don't load all of an i2b2 file into memory #108

mikix commented Dec 21, 2022 •

edited

Loading

mikix Dec 27, 2022

dogversioning Dec 28, 2022

dogversioning Dec 28, 2022

dogversioning Dec 28, 2022

mikix Dec 28, 2022

dogversioning Dec 28, 2022

mikix Dec 28, 2022

		@@ -0,0 +1,75 @@
		"""Mappings of various external coding systems to the concepts they represent"""

fix: don't load all of an i2b2 file into memory #108

fix: don't load all of an i2b2 file into memory #108

Conversation

mikix commented Dec 21, 2022 • edited Loading

Description

Checklist

mikix Dec 27, 2022

Choose a reason for hiding this comment

dogversioning Dec 28, 2022

Choose a reason for hiding this comment

dogversioning Dec 28, 2022

Choose a reason for hiding this comment

dogversioning Dec 28, 2022

Choose a reason for hiding this comment

mikix Dec 28, 2022

Choose a reason for hiding this comment

dogversioning Dec 28, 2022

Choose a reason for hiding this comment

mikix Dec 28, 2022

Choose a reason for hiding this comment

mikix commented Dec 21, 2022 •

edited

Loading