-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: don't load all of an i2b2 file into memory #108
Conversation
0f6b309
to
418e8bb
Compare
7a5ef75
to
27664f6
Compare
@@ -0,0 +1,75 @@ | |||
"""Mappings of various external coding systems to the concepts they represent""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this file out of resources/
-- resources is a name we usually use for folders with data, not code.
The changes I made here are adding Outpatient
to SNOMED_ADMISSION
and separating the CDC_RACES
and CDC_ETHNICITY
dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine - i mostly just left this in the same place as the extant fhir_template out of an abundance of caution
f3e8910
to
59edeb4
Compare
@@ -0,0 +1,75 @@ | |||
"""Mappings of various external coding systems to the concepts they represent""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine - i mostly just left this in the same place as the extant fhir_template out of an abundance of caution
from cumulus.loaders.i2b2.schema import ObservationFact, PatientDimension, VisitDimension | ||
|
||
|
||
def extract_csv(path_csv: str, sample=1.0) -> pandas.DataFrame: | ||
def extract_csv(path_csv: str, batch_size: int) -> Iterator[dict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally a much better design than the previous iteration, but a potential optimization if this is still the bottleneck - rather than directly iterating over all the chunks, try a map/reduce against the resulting dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair -- I think as part of #109, a map/reduce style will have to happen to large parts of the code. My brain isn't natively cloud-ified yet, so I didn't think to start that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not even like a cloud comment per se - i think the deal is that at some point in the dependency chain dataframe map/reduce hits the numpy cython map/reduce, which through Dark Compiler Magics is like the fastest thing you can do in python for accessing inside of large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah yeah I just meant "cloud thinking" / massive-data-at-scale thinking, etc
The primary change in this commit is to stop loading i2b2 input files all at once, but rather stream them in, in chunks determined by the --batch-size parameter. But this commit also includes several small fixes: - Fixes location of MS tool during CI - Adds comma-formatting to a lot of progress-count prints - Continues ETL even if cTAKES can't process one message (just logs the error instead) - Changes default batch size from 10M to 200k. This works more reliably for small-memory (8G) machines. The previous number was optimized for the size of the resulting parquet files. This number is optimized for memory during the run, which feels like a safer default. - When using --input-format=ndjson and pointing at a local folder, we now still use a temporary folder and copy in just the resource ndjson files we want. This is to speed up the MS deid tool, so it doesn't have to read all possible ndjson inputs. - Add better progress messaging while reading i2b2 files. - Separate out race & ethnicity from i2b2, which combines them - Properly set DocumentReference.type and status
59edeb4
to
b104b07
Compare
Description
The primary change in this commit is to stop loading i2b2 input files
all at once, but rather stream them in, in chunks determined by the
--batch-size parameter.
But this commit also includes several small fixes:
the error instead)
reliably for small-memory (8G) machines. The previous number was
optimized for the size of the resulting parquet files. This number
is optimized for memory during the run, which feels like a safer
default.
we now still use a temporary folder and copy in just the resource
ndjson files we want. This is to speed up the MS deid tool, so it
doesn't have to read all possible ndjson inputs.
Checklist
docs/
) needs to be updated