Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique number of subject_ids does not match the number stated in documentation #1606

Open
MichalWeisman opened this issue Jul 26, 2023 · 2 comments

Comments

@MichalWeisman
Copy link

Hello,

In the dataset's documentation, it is mentioned that the data contains information for 40,000 patients.
However, when computing the number of unique values in the subject_id column I receive a much greater number.
For example, I would like to know the number of patients who had a blood glucose test:

glucose_df = lebevents[lebevents['itemid'] == 50931] # 50931 is the itemid of the glucose test
print(glucose_df['subject_id'].nunique())

The output is 247,005 which is much greater than 40,000.
Thanks

@heisenbug-1
Copy link

heisenbug-1 commented Jul 27, 2023

Hi! Correct me if I'm wrong, but I think there are 2 modules: host and icu
Hosp contains information about all patients admitted to the hospital, and icu is a subset of hosp (patients that were admitted to icu, so all icu patients have a hadm_id for hospital admission and a stay_id for icu admission).

Unique number of subject ids in hosp is 180733, and icu has 50920 unique subject ids. The documentation says "over 40000 ICU patients", so this checks out :)

The lab events table includes patients that weren't necessarily admitted to the hospital/icu, and there are 255876 unique subject ids in the lab events table.
If you only need glucose values for ICU patients, I'd recommend filtering by subject ids from the icu.icustays table:

SELECT DISTINCT lab.*
FROM  mimiciv_hosp.labevents as lab, mimiciv_icu.icustays as icu
WHERE lab.subject_id = icu.subject_id
AND lab.itemid = 50931

This query returns 50738 unique subject_ids. You can rewrite it for pandas like:
glucose_df = labevents[(labevents['itemid'] == 50931) & (labevents['subject_id].isin(icustays_df['subject_id']))]
print(glucose_df['subject_id'].nunique())

Hope this provides you with more insight into the schema :)

@MichalWeisman
Copy link
Author

@heisenbug-1 Thank you very much for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants