Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort sampling #1657

Merged
merged 37 commits into from
Nov 9, 2020
Merged

Cohort sampling #1657

merged 37 commits into from
Nov 9, 2020

Conversation

MaximMoinat
Copy link
Contributor

@MaximMoinat MaximMoinat commented Oct 14, 2020

Backend code for a the Atlas cohort sampling feature (OHDSI/Atlas#2357), allowing to store created samples per cohort and data source.

For more details see this forum post.

blootsvoets and others added 28 commits January 28, 2020 11:04
Note: this makes reading a sample much much slower.
Add endpoint for has-samples by sourceKey
@chrisknoll
Copy link
Collaborator

Hi, thanks for the PR. I think it's worth trying to get this update in the 2.8 release since we have a few extra weeks (so we don't interrupt the symposium). If things go smoothly, let's try to get this in.

The conflict was related to a recent file update on cohort definition service, which involved re-formatting it to fit the decision to use tab-indentation. Can you resolve the conflict (should be simple)? In addition, it would be helpful to submit the files with tab-indent so that we won't have to go back later and update the formatting.

@MaximMoinat
Copy link
Contributor Author

Thanks @chrisknoll. Will look into converting spaces into tabs.

@MaximMoinat
Copy link
Contributor Author

Last commit solves the BIGINT issue mentioned by @chrisknoll in the Atlas PR:
OHDSI/Atlas#2357 (comment)

@chrisknoll
Copy link
Collaborator

chrisknoll commented Oct 28, 2020

Hi, the changes to covert to string look good, I confirmed that the payload in the JSON is comming as a string, however the values are still corrupted. But, i figured out where:

In line 467 of CohortSamplingService.java, you have a RowMapper that pulls the data out of the result set. For personId, you're using rs.getInt("person_id"), but this should be rs.getLong("person_id"). I changed this locally, and the person_id's returned the correct values (as strings).

Once I made both changes on the Atlas and WebAPI side, I can pull up the patient profile and the visualization appears!

@MaximMoinat
Copy link
Contributor Author

@chrisknoll Thanks again for your tests. Changes made in our local branch

@MaximMoinat MaximMoinat reopened this Oct 30, 2020
Copy link
Collaborator

@chrisknoll chrisknoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested sample endpoints on MSSQL and PDW.

@chrisknoll chrisknoll merged commit ae36601 into OHDSI:master Nov 9, 2020
@jduke99
Copy link
Contributor

jduke99 commented Mar 24, 2021

Hi @MaximMoinat @chrisknoll we would like to leverage the cohort_sample_element table for the validation tool. Ideally we'd like to add a cohort_sample_element_id (ie a unique id for each row in this table) and an annotation_result_id field to connect between the sample patients and the annotations that are created.

Any concerns about this? And if okay, should we add to this table create or are we able to do flyway type additions to $results_schema tables?

Thanks!

Jon

For reference, the code:

IF OBJECT_ID('@results_schema.cohort_sample_element', 'U') IS NULL
CREATE TABLE @results_schema.cohort_sample_element(
cohort_sample_id int NOT NULL,
rank_value int NOT NULL,
person_id bigint NOT NULL,
age int,
gender_concept_id int
);

@blootsvoets
Copy link
Contributor

Sounds alright to me. The existing sampling could would of course need to be updated. I’m not sure what the policy for results DDL migrations is; flyway style to update a table, including rules on how to create new id’s for existing samples would be more convenient for users that already have the samples table.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Mar 24, 2021

I’m not sure what the policy for results DDL migrations

There are no flyway migrations that manage results schema upates (there are far too many platforms to support). Only WebAPI tables are managed by flyway. So, in cases where there's updates to a results schema, we call it out in the release notes and reference the relevant DDL. Users need to manually manage these changes (including migrating existing data if necessary).

@jduke99 : is there any reason why you can't use the sample_id-person_id in the sample_element table to link to your own table dedicated to tracking annotation results? I'm nervous about having cross dependencies on this table (ie: every time someone wants to attach information about a sample, it needs to add a colum n to cohort_sample_element).

@jduke99
Copy link
Contributor

jduke99 commented Mar 24, 2021

Thanks both for the feedback.

@chrisknoll This could be done re linking to the sample_id-person_id. So the cohort_sample_element_id field is kind of just for convenience and can be dropped if it complicates things. But the annotation_result_id is a bit trickier for the following reason:

We really wanted to have Samples be the home for launching the validation because it just makes sense from a workflow perspective. You create a sample of patients, you launch a validation from the Sample page
eg as an action here

image
and when you look at the sample list you can see which patients have been reviewed, eg as an extra column here

image

it is clean and we believe will make for a very sensible workflow.

it will be relatively inefficient to run through the whole annotations table looking for matches of who has been annotated, compared with just writing a flag to the sample_elements table. Essentially if it is null they have not been annotated. If it has a value then they have. No need to join any tables.

So I just would prefer for it to run smoothly and quickly rather than potentially be slow and grind unnecessarily. Adding an annotation_result_id column would achieve this goal. However having everyone manually update is a big bummer. Too bad we can't do flyway on $results...

Anyway, those are my thoughts. Ultimately we will go with whatever you recommend.

Thanks,

Jon

@chrisknoll
Copy link
Collaborator

it will be relatively inefficient to run through the whole annotations table looking for matches of who has been annotated, compared with just writing a flag to the sample_elements table. Essentially if it is null they have not been annotated. If it has a value then they have. No need to join any tables.

Hey, @jduke99 . I understand your concerns,, but we need to balance clean design with performance. If every new feature attached to samples (and there could be many) results in adding a new column to this table, it quickly gets out of control. The cohort annotation feature is a auxiliary function of cohort samples. There could be others, and for each of these auxiliary functions (such as, mark a person 'complete' for a given sample and annotation set) then we should have a table that tracks the person, the annotation set and the sample, and there completion status. With this form, you can also track other information about their responses: is it complete or partial? What % of completion? Does the result mean they pass or fail? All these points of information can't be put into the cohort_sample_element table.... the cohort_sample_element table is just supposed to track members of a sample.

So, I'd ask that you plan on having a dedicated table that is keyed on the sample_id, person_id + {annootation identifier}. Yes, this does result in a join (and a left join people without annotation status shoudl be returned) but the performance impact will be minimal (how large of a sample are we talking about? 10...100..1000? That will take no time to join). We definitely can't design the database schema around 'removing joins' we're going to have to join at some point.

@jduke99
Copy link
Contributor

jduke99 commented Mar 25, 2021

Sounds good @chrisknoll. Thanks for the insight and we will take your suggested approach of keying on the sample_id, person_id + {annotation identifier} in a separate table. Thanks for the quick response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants