Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve mapped concept set query #299

Open
ablack3 opened this issue Jul 5, 2024 · 1 comment
Open

improve mapped concept set query #299

ablack3 opened this issue Jul 5, 2024 · 1 comment
Labels
CohortDiagnostics enhancement New feature or request

Comments

@ablack3
Copy link

ablack3 commented Jul 5, 2024

The mapped concept set query is very slow on sqlite when the concept set is large sometimes taking minutes to complete.

mappedConceptSet <- function(dataSource,

It can be improved by adding indexes and performing the final join in R instead of the database.

@cebarboza and I were working on it.

Indexes drastically sped up this part of the query

WITH resolved_concepts_mapped
    AS (
    	SELECT concept_sets.concept_id AS resolved_concept_id,
    		c1.concept_id,
    		c1.concept_name,
    		c1.domain_id,
    		c1.vocabulary_id,
    		c1.concept_class_id,
    		c1.standard_concept,
    		c1.concept_code
    	FROM (
    		SELECT DISTINCT concept_id
    		FROM @schema.@resolved_concepts
    		WHERE database_id IN (@databaseIds)
    			AND cohort_id = @cohort_id
    		) concept_sets
    	INNER JOIN @schema.@concept_relationship cr ON concept_sets.concept_id = cr.concept_id_2
    	INNER JOIN @schema.@concept c1 ON cr.concept_id_1 = c1.concept_id
    	WHERE relationship_id = 'Maps to'
    		AND standard_concept IS NULL
    	)

However the final join was still slow. It is a many to many join that multiplies the rows of concept table (result from above) for every combination of database, concept set, and cohort.

 SELECT
        c.database_id,
    	c.cohort_id,
    	c.concept_set_id,
    	mapped.*
    FROM (SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) c
    INNER JOIN resolved_concepts_mapped mapped ON c.concept_id = mapped.resolved_concept_id
    {@cohort_id != ''} ? { WHERE c.cohort_id = @cohort_id};

However when we downloaded this SELECT DISTINCT concept_id, database_id, cohort_id, concept_set_id FROM @schema.@resolved_concepts) into R and then did the join it was very fast.

Perhaps there is a way to speed the whole query up in SQL. For us adding indexes and splitting the query into two and doing the final join in R did fix the slowness in the shiny app.

I was also thinking that the app might not really need to do this final join since it duplicates a lot of data. In any case we are recording our findings in this issue and can make a PR if the approach we came up with seems reasonable.

The proposal would be to optimize the mappedConceptSet function by adding indexes if they are not there (for sqlite only since I'm not sure if adding indexes really works across all dbms), split the query into two parts as shown above, and then do the final join in R instead of in sql.

@jreps jreps added CohortDiagnostics enhancement New feature or request labels Aug 22, 2024
@ablack3
Copy link
Author

ablack3 commented Sep 19, 2024

Adding these indexs to the sqlite database improves the speed and user experience of the cohort diagnostics shiny app.

DBI::dbExecute(con, "CREATE INDEX idx_resolved_concepts ON resolved_concepts (concept_id, database_id, cohort_id, concept_set_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_relationship ON concept_relationship (concept_id_2, relationship_id);")
DBI::dbExecute(con, "CREATE INDEX idx_concept_main ON concept (concept_id, standard_concept);")

We could also consider pre-joining this data once when we build the sqlite database rather than joining in the database while the use is interacting with the app. It would add some columns to the resolved_concepts table so would require more disk space but might be better for the app's user experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CohortDiagnostics enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants