Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't load all of an i2b2 file into memory #108

Merged
merged 1 commit into from
Dec 28, 2022
Merged

Commits on Dec 28, 2022

  1. fix: don't load all of an i2b2 file into memory

    The primary change in this commit is to stop loading i2b2 input files
    all at once, but rather stream them in, in chunks determined by the
    --batch-size parameter.
    
    But this commit also includes several small fixes:
    - Fixes location of MS tool during CI
    - Adds comma-formatting to a lot of progress-count prints
    - Continues ETL even if cTAKES can't process one message (just logs
      the error instead)
    - Changes default batch size from 10M to 200k. This works more
      reliably for small-memory (8G) machines. The previous number was
      optimized for the size of the resulting parquet files. This number
      is optimized for memory during the run, which feels like a safer
      default.
    - When using --input-format=ndjson and pointing at a local folder,
      we now still use a temporary folder and copy in just the resource
      ndjson files we want. This is to speed up the MS deid tool, so it
      doesn't have to read all possible ndjson inputs.
    - Add better progress messaging while reading i2b2 files.
    - Separate out race & ethnicity from i2b2, which combines them
    - Properly set DocumentReference.type and status
    mikix committed Dec 28, 2022
    Configuration menu
    Copy the full SHA
    b104b07 View commit details
    Browse the repository at this point in the history