Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load HPDS data from CSV/DBMS, and the format issue #49

Open
finkbine opened this issue Feb 10, 2023 · 16 comments
Open

How to load HPDS data from CSV/DBMS, and the format issue #49

finkbine opened this issue Feb 10, 2023 · 16 comments

Comments

@finkbine
Copy link

Hi,
I am a new user, I tried to follow the instruction in "Project Load HPDS Data From CSV" part, however, the variable definition is not clear to me:
"PATIENT_NUM","CONCEPT_PATH","NVAL_NUM","TVAL_CHAR","TIMESTAMP". Could you give me a real example file, especially for "CONCEPT_PATH"?

You mentioned that "This job requires datafile in csv format in location - /usr/local/docker-config/hpds_csv/allConcepts.csv", what if I want to upload my own csv files? After "Run Jenkins job - Start PIC-SURE" is finished, does it mean that I will see new samples posted in pic-sure website?
thanks a lot!

@anilk2hms
Copy link
Contributor

Jenkins is configured to use local directory on host. For the job "Project Load HPDS Data From CSV" it uses the folder /usr/local/docker-config/hpds_csv/ on host and mounted into the container in this path: /opt/local/hpds/
So, on the host you place your csv file in this location: /usr/local/docker-config/hpds_csv/ - Make sure you delete the allConcepts.csv and place the file with same name and run the jenkins job

@dmpillion
Copy link
Contributor

We have examples of how to map and load your data and examples of how the NHANES data was loaded in this repo: https://github.com/hms-dbmi/pic-sure-hpds-phenotype-load-example
Here is an example of a concept path from NHANES: \demographics\SEX\female\

Please let us know if you need additional assistance.

@finkbine
Copy link
Author

thank you !

@finkbine
Copy link
Author

Hi,
Dose pic-sure allow for multiple databases/projects ?
For example, the folder /usr/local/docker-config/hpds_csv/allConcepts.csv can only have one allConcepts.csv file.

thanks a lot!

@dmpillion
Copy link
Contributor

Hi,

We want to make sure we understand the question. When you say "multiple databases/projects", does that mean that you want them displayed with different root paths?

Can you provide a more detailed example?

Thanks!

@finkbine
Copy link
Author

finkbine commented Feb 16, 2023

Dear, dmpillion:

Yes, projects have different root paths, I don't know the exact meaning you mentioned. For example, a single project will have a set of SUBJECT_ID as primary key, another project will have a different set of SUBJECT_ID, therefore these two csv files cannot be combined to one single allConcepts.csv file.

There are also several other questions:

  1. In the csv file, what if a single SUBJECT_ID has multiple different records with the same variable name CONCEPT_PATH, this is common for a patient with many observations, for example:
    1,"\a\b\c",1,,111111
    1,"\a\b\c",2,,111111
    1,"\a\b\c",3,,111111
    Can pic-sure correctly handle it? I saw that pic-sure can list the number of observation and the number of unique primary SUBJECT_ID.

  2. We are interested in searching keywords in variable list, also the contents of variable. Now it seems that pic-sure cannot search keywords in contents, for example, we want to search a 10000 words long text content of a variable.

  3. A character variable with long string (~ 10000 length) cannot be imported correctly, error message was:

Feb 21, 2023 4:36:23 AM com.google.common.cache.LocalCache processPendingNotifications
WARNING: Exception thrown by removal listener
java.lang.OutOfMemoryError: Java heap space

  1. Where is the function/mechanism to share our data with other researchers?

  2. Date format, not character or numeric.

thanks a lot!
xiangjun

@mangmang1216
Copy link

Thank you both. I'm the PI on one of the pilot AIM-AHEAD projects (I'm a physician scientist and not a data scientist) and Dr. Paul Avilach advised our group to try installing PIC-SURE to be linked to AWS SWB (Xiangjun has been working on this for several weeks). The concept mapping is interesting but unclear how feasible it is. I have extracted clinical data on 20,000 patients with likely millions of different unique longitudinal lab values and several millions of unique ICD/CPT/HCPCS codes. Would each one require its own concept mapping for PIC-SURE to function properly? We also have semi-structured and unstructured long clinical notes.

I read from the example that the core of PIC-SURE is i2b2 which is a data aggregation/search platform that our institution already has. Personally, I'm trying to understand the benefit of using PIC-SURE HPDS platform over standard SQL platform... or just leave them as csv files that we can easily import to any statistical software for data merging and analyses. Is PIC-SURE more like i2b2, SlicerDicer, or does it have any built-in NLP capability similar to EPIC search engine?

@finkbine
Copy link
Author

Dear dmpillion:

We have error when importing csv into pic-sure, Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The heapsize was assigned in default: + docker run --name=hpds-etl -v /usr/local/docker-config/hpds_temp:/opt/local/hpds -v /usr/local/docker-config/hpds_csv/allConcepts.csv:/opt/local/hpds/allConcepts.csv -e HEAPSIZE=4096 -e LOADER_NAME=CSVLoader --name hpds_data_load_csv hms-dbmi/pic-sure-hpds-etl:LATEST

How can users set memory size?

thanks

@anilk2hms
Copy link
Contributor

Assuming you are using this Job (Load HPDS Data From CSV) to load the data.
If you want to adjust the HEAPSIZE, click on Configure the job and go to the build section, there you can find -e HEAPSIZE=4096, adjust this and save. Set the HEAPSIZE half of your available RAM. Then rerun the jenkins job.

@finkbine
Copy link
Author

finkbine commented Mar 30, 2023

Hi,
There is still error message for this setting, heapsize was increased to 100000.
Another question is that datetime can only be imported as numeric variable (unix timestamp as seconds), is there any way to have a function to transfer unix timestamp back to datetime ?

+ docker run --name=hpds-etl -v /usr/local/docker-config/hpds_temp:/opt/local/hpds -v /usr/local/docker-config/hpds_csv/allConcepts.csv:/opt/local/hpds/allConcepts.csv -e HEAPSIZE=100000 -e LOADER_NAME=CSVLoader --name hpds_data_load_csv hms-dbmi/pic-sure-hpds-etl:LATEST

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fd4aa000000, 6325010432, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 6325010432 bytes for committing reserved memory.
# An error report file with more information is saved as:
# //hs_err_pid7.log
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Thanks

@dmpillion
Copy link
Contributor

1.) Can you confirm the available RAM on your machine?

2.) Can you explain the use case for wanting to transfer the UNIX?
Once the data is loaded into PIC-SURE it will be in date time, not UNIX timestamp as seconds.

@finkbine
Copy link
Author

Hi, dmpillion:

  1. Our configuration is:
    EC2 Instance: t2.2xlarge, 32 GB of RAM, 8 cores, 100 GB of hard drive

  2. Instruction said only numeric and character, two types of variable, for example, a string "2021-1-22". Do you mean if "2021-1-22" was treated as character, pic-sure will transform it to datetime ? In my previous importing process, "2021-1-22" can not be identified as datetime, just as a string.

thanks again!
xiao

@mangmang1216
Copy link

To further clarify Xiao's comment, our dataset has longitudinal date/time variable stamps. For example, we need to load every complete blood count result from 1/1/2011 to 1/1/2023. Based on the NHANES tutorial, all date/time stamps must be first converted to UNIX since it would be otherwise treated as a string character. However, after they are converted to UNIX, we can't seem to be able to convert them back into date/time presentation in PIC-SURE. Thank you.

@anilk2hms
Copy link
Contributor

Machine has 32GB Ram, but provisioned (HEAPSIZE=100000, 100000/1024) ~97 GB..
Set the HEAPSIZE half of available RAM. Above use case ( 32 GB RAM ) , it should not be more than 16384.

@finkbine
Copy link
Author

finkbine commented Apr 5, 2023

thank you, I will try it when our system admin comes back

@finkbine
Copy link
Author

finkbine commented Apr 12, 2023

@anilk2hms Hi, we upgraded our system with 64 Gb ram, there is no any error message, please see the log file attached.
However, pic-sure cannot show tree structure, could you help us?

image
test.log

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants