Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recap data pull discussion #10

Open
kcho opened this issue Apr 2, 2021 · 26 comments
Open

Recap data pull discussion #10

kcho opened this issue Apr 2, 2021 · 26 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@kcho
Copy link
Member

kcho commented Apr 2, 2021

lochness.redcap is pulling all available data from the REDCap to a json file

  • when the lochness.redcap.sync is re-executed, lochness pulls the whole data again and then compares existing json before overwriting.

Problems

1. Daily pull of the data for all subjects may put too much load on the REDCap server

  • Do we know what is the limit of API data pull? eg) 1GB in a week?
  • How big will the json file be for a subject be?

2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the DPDash

  • how many fields are there?
  • will the fields be changed in any point of the study?

Solutions

  • Add a function in lochness.redcap to pull only specific fields?
    • Justin & Habib's suggestion
  • Add a function that pulls the field that shows the date of last edit (Do we have such field in REDCap?)
    • if this field is different from the downloaded json -> redownload all files
    • if this field is the same as from the downloaded json -> skip
  • Include 'redcap completed' column in the metadata.csv to stop or make the pulling less often.
@kcho kcho added documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested labels Apr 2, 2021
@tashrifbillah
Copy link

tashrifbillah commented Apr 2, 2021

Hi @kcho and @sbouix , let's continue the discussion here.

By Kevin (edited by Tashrif):

I found a “Data Entry Trigger” function in Redcap. Whenever a record is modified or updated, it sends a POST signal with a bunch of information to a dedicated server. If the major problem in pulling all the data on a daily basis is the REDCap server overloading, do you think implementing “Data Entry Trigger” and connecting to lochness would be a solution (or overkill?)

image

Suggested workflow:

  • Redcap Record gets updated
  • “Data entry trigger” sends the name of the data field updated to AWS
  • On the AWS, the list of updated records is stored
  • Lochness pulls this information from the AWS
  • Only download the updated fields

This would solve the REDCap server problem and we would be able to keep all of the up-to-date REDCap data in lochness.

@tashrifbillah
Copy link

tashrifbillah commented Apr 2, 2021

Okay, here is my modified workflow:

  • REDCap record is updated
  • Data Entry Trigger emits a signal
  • Our very own https://predict.bwh.harvard.edu/ hosted watchdog (TBD) catches the signal
  • The watchdog (TBD) determines whether the update is an essential one
  • If yes, asks lochness to pull the updated record

The last three steps could be done by a cron like bot.

@sbouix
Copy link

sbouix commented Apr 2, 2021

To add the agenda the ability to detect tags for particular variables.

@kcho
Copy link
Member Author

kcho commented Apr 2, 2021

Thanks for this @tashrifbillah

Could you set up a url under the https://predict.bwh.harvard.edu/, so it can catch the POST signal from REDCap Data Entry Trigger please?

or if we have any other publically open ports among PNL servers, please let me know. I'll test getting the signal.

@sbouix
Copy link

sbouix commented Apr 2, 2021

The only 2 externally facing servers I know of are hcpep-xnat and our web server. Predict is behind the firewall.

@tashrifbillah
Copy link

tashrifbillah commented Apr 2, 2021

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing. Also, where did you get the screenshot? If writing is hard, MS Teams call works for me.

@tashrifbillah
Copy link

Is this the function I need?

@kcho
Copy link
Member Author

kcho commented Apr 2, 2021

Hi Kevin, do you know of a tutorial that I can go through to learn to upload a file to REDCap? I need to be able to upload, trigger, and listen independently to be able to set up such a thing.

I have not uploaded a file before, but I would suggest to look at the api playground and try import file API Method.
API doc is here: https://redcap.partners.org/redcap/api/help

Also, where did you get the screenshot?

Screenshot is from
REDCAP - "Project Setup" -> "Enable optional modules and customizations"

@kcho
Copy link
Member Author

kcho commented Apr 2, 2021

Quickly tested to see if REDCap sends the signal to an open server.

  • Project id
  • Username
  • Record ID
  • Name of instrument modified

are sent to the server. I think it can act as a very useful logging system.

I’ll bring this up in our next meeting, so we can discuss how we can including this.

redcap_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2F&project_url=https%3A%2F%2Fredcap.partners.org%2Fredcap%2Fredcap_v10.0.30%2Findex.php%3Fpid%3D26709&project_id=26709&username=kc244&record=100111111&instrument=adverse_events_ae&adverse_events_ae_complete=0

@tashrifbillah
Copy link

2. extensive work is required on the logbook to select and extract the data from the json dump to visualize in the 
DPDash

how many fields are there?

The HCP-EP survey I am working with has 915 fields in each of the six instruments a.k.a surveys.

will the fields be changed in any point of the study?

The fields are the same across the six instruments so they should be consistent across the study.

@kcho
Copy link
Member Author

kcho commented Apr 6, 2021

@sbouix @tashrifbillah
I thought about the architecture below for what we have discussed yesterday about the REDCap data pulling. I think there were two main problems we discussed yesterday. One is PII and the other is server overloading. Below is my suggestion, please let me know what you think. I'll start working on them soon.

Proposed REDCap pulling architecture

PII part

  1. lochness.redcap pulls all data from REDCap server to PROTECTED/survey/raw/ABCD01.json
  2. Save json - data free from PII
  • lochness.redcap (or predict_pii.redcap or logbook.redcap)
    • from PROTECTED/survey/raw/ABCD01.json remove all PII fields
      • using REDCap tags "PII" (need to review how we can pull this information)
    • and save it in GENERAL/survey/raw/ABCD01.json
  1. Save another json - data with the PIIs replaced with pseudo-random strings
  • lochness.redcap (or predict_pii.redcap or logbook.redcap)
    • process PII fields in PROTECTED/survey/raw/ABCD01.json and save it in PROTECTED/survey/processed/ABCD01.json
    • copy PROTECTED/survey/raw/ABCD01.json to GENERAL/survey/processed/ABCD01.json

Redcap server overloading problem part

  1. before pulling any data from REDCap, lochness.redcap checks for files under PROTECTED/survery/raw

    • if there is ABCD01.json already
      • check for db, which is updated live by listening to the POST-SIGNAL from REDCap Data Entry Trigger
        • if ABCD01 is in the db, execute the download
        • if ABCD01 is not in the db, skip the download
  2. repeat PII part above

  3. in the lochness - lochness transfer, change of the ABCD01.json should be detected by sha1 / hash / other methods to only pull the updated data.

@tashrifbillah
Copy link

What is the distinction between points 2 and 3 under PII Part?

@kcho
Copy link
Member Author

kcho commented Apr 6, 2021

What is the distinction between points 2 and 3 under PII Part?

Sorry - edited a bit
Point 2 is for saving a json in GENERAL - data that has no PII
Point 3 is for saving a json in GENERAL - data that has the PII fields replaced to pseudo-random strings

@sbouix
Copy link

sbouix commented Apr 6, 2021

Let's concentrate on REDCap server overloading first.

The PII masking is more complex, some variables can be deleted (e.g. name), others replaced by another variable (e.g. birthdate -> age in years). I am not sure we should have two copies of pretty much the same thing (raw vs processed). Also because I would like to import the anonymized data into MGB REDCap, we should figure out how that will be affected by (2) vs (3). Finally, we may be better off having a table with a list of pii variables as input rather than try to extract the tag from REDCap.

@sbouix
Copy link

sbouix commented Apr 6, 2021

For lochness to lochness transfer. I also think datalad might be useful. Something to discuss with Chris and Mathias on Friday.

@tashrifbillah
Copy link

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

@kcho
Copy link
Member Author

kcho commented Apr 7, 2021

Hi @kcho , did you try making a workstation listen to REDCap signal yet? If you haven't, I can try that for my entertainment out of DPDash crisscross ;)

I haven't yet tried it in the workstation- but I've drafted a commandline tool and a module for listening to the POST signal from the redcap server in the lochness.redcap
https://github.com/PREDICT-DPACC/lochness/blob/devel/kcho/redcap_new_arch/scripts/listen_to_redcap.py

@kcho
Copy link
Member Author

kcho commented Apr 8, 2021

Let's concentrate on REDCap server overloading first.

The model shown below has been uploaded to the devel/kcho/redcap_new_arch.
master...PREDICT-DPACC:devel/kcho/redcap_new_arch

To do

  • test in PNL workstation
  • record a demo
  • discuss the consequences of the Data Entry Trigger (DET) capture server going down

Figure

image

Summary

1. Make a database from the POST signals from the REDCap Data Entry Trigger

  • listen_to_recap.py: live server that captures and saves all the POST signals received from REDCap Data Entry Trigger
    • saves a table looks like below
timestamp project_id redcap_username record instrument
1617823322.701979 26709 kc244 subject0002 inclusionexclusion
1617823322.711633 26709 kc244 subject0001 inclusionexclusion
  • The path of the DB above entered into config.yml

2. lochness.redcap checks for any updates in the Data Entry Trigger database before executing datapull

  • lochness.redcap.get_data_entry_trigger_df: loads the DET database
  • lochness.redcap.check_if_modified: compares st_mtime of already saved jsons vs DET database for any recent updates

@tashrifbillah
Copy link

In

check DET-DB
recent update

Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

@kcho
Copy link
Member Author

kcho commented Apr 8, 2021

In

check DET-DB
recent update

Do you plan to compare checksum like mediaflux does? Here are nipype ways of computing checksum:

Since the Data Entry Trigger Database (DET-DB) is a CSV file containing all the REDCap field updates and the timestamp of each POST signal, I compare the last modified date of the already existing json file vs last update captured in the DET-DB for each subject (if this subject exists in the DET-DB)

@tashrifbillah
Copy link

Hi @kcho , is it expecting an empty csv file?

@kcho
Copy link
Member Author

kcho commented Apr 8, 2021

Hi @kcho , is it expecting an empty csv file?

It's expecting the path of the DET-DB csv file. If the already csv exists, the live capture server will append new information to the existing csv file.

@tashrifbillah
Copy link

Currently, how is it being programmed--listen_to_redcap.py running sync.py --source redcap sort of?

@kcho
Copy link
Member Author

kcho commented Apr 17, 2021

Currently, the two python scripts have to be executed separately. Just realized it could be useful to design following your comment.

listen_to_redcap.py running sync.py --source redcap sort of?

Any downside to doing this? Programatically, how would you spin out sync.py continuously running while also continuously running the listen_to_redcap.py from the single execution? multiprocess module?

@tashrifbillah
Copy link

tashrifbillah commented Apr 17, 2021

multiprocess module?

It should be a chanied process--trigger comes first and then pull. We shall discuss more during our Monday brainstorming session.

By the way, do we have access to @sbouix 's presentation on what data reside in what platforms? I am trying to understand which platforms should trigger data entry signals. I understand for PRoNET, it would be REDCap. What would that be for PRESCIENT?

@sbouix
Copy link

sbouix commented Apr 17, 2021

The primary database system for PRESCIENT will be RPMS (Research Project Management System). It is custom built by the Orygen team and doesn't have the extensive documentation or API functionalities of REDCap. We're working to get access to their IT infrastructure to setup a development environment and start developing the Lochness RPMS module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants