Is134 get db cohort method data #136

mvankessel-EMC · 2023-04-11T13:00:04Z

Pull Request for issue: #134

This PR contains the following changes:

Moved down sampling code to function downSample.
Deprecated the boolean (logical) support for removeDuplicateSubjects, which now defaults to "keep all", and updated test-simulation.R to accommodate this change.
Moved the studyStartDate and studyEndDate NULL updates after the assertions.
Presampeling code is moved to the function preSample, which is re-run for target and comparator cohorts.

…ean support

R/DataLoadingSaving.R

schuemie · 2023-04-11T13:45:51Z

R/DataLoadingSaving.R

+                 tempEmulationSchema,
+                 targetId,
+                 maxCohortSize,
+                 sampled)


Why does downSample() need sampled, which is guaranteed to be FALSE?

Finally: If a function has more than 2 arguments I really prefer to use named arguments, even if it looks dumb, just to avoid correctly assigning the wrong value to the wrong parameter. So connection = connection, , etc.

Yes, I'll remove the sampled argument, and just specify it in downSample as FALSE at the start.

This is the bit of code from the develop branch that downSample replaces DataLoadingSaving.R L196:238:

renderedSql <- SqlRender::loadRenderTranslateSql("CountCohorts.sql", packageName = "CohortMethod", dbms = connectionDetails$dbms, tempEmulationSchema = tempEmulationSchema, target_id = targetId ) counts <- DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE) ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount)) preSampleCounts <- dplyr::tibble(dummy = 0) idx <- which(counts$treatment == 1) if (length(idx) == 0) { preSampleCounts$targetPersons <- 0 preSampleCounts$targetExposures <- 0 } else { preSampleCounts$targetPersons <- counts$personCount[idx] preSampleCounts$targetExposures <- counts$rowCount[idx] } idx <- which(counts$treatment == 0) if (length(idx) == 0) { preSampleCounts$comparatorPersons <- 0 preSampleCounts$comparatorExposures <- 0 } else { preSampleCounts$comparatorPersons <- counts$personCount[idx] preSampleCounts$comparatorExposures <- counts$rowCount[idx] } preSampleCounts$dummy <- NULL if (preSampleCounts$targetExposures > maxCohortSize) { message("Downsampling target cohort from ", preSampleCounts$targetExposures, " to ", maxCohortSize) sampled <- TRUE } if (preSampleCounts$comparatorExposures > maxCohortSize) { message("Downsampling comparator cohort from ", preSampleCounts$comparatorExposures, " to ", maxCohortSize) sampled <- TRUE } if (sampled) { renderedSql <- SqlRender::loadRenderTranslateSql("SampleCohorts.sql", packageName = "CohortMethod", dbms = connectionDetails$dbms, tempEmulationSchema = tempEmulationSchema, max_cohort_size = maxCohortSize ) DatabaseConnector::executeSql(connection, renderedSql) }

schuemie · 2023-04-11T13:49:53Z

R/DataLoadingSaving.R

-    }
+  DatabaseConnector::executeSql(connection, renderedSql)
+
+  sampled <- FALSE


I recommend moving sampled <- FALSE to an else clause of if (maxCohortSize != 0) {

Done: 3056327.

R/DataLoadingSaving.R

codecov · 2023-04-11T14:03:00Z

Codecov Report

Merging #136 (e53d78e) into develop (456537f) will decrease coverage by 1.62%.
The diff coverage is 58.04%.

❗ Current head e53d78e differs from pull request most recent head bbbb1d4. Consider uploading reports for the commit bbbb1d4 to get more accurate results

@@             Coverage Diff             @@
##           develop     #136      +/-   ##
===========================================
- Coverage    88.63%   87.02%   -1.62%     
===========================================
  Files           22       23       +1     
  Lines         5172     5316     +144     
===========================================
+ Hits          4584     4626      +42     
- Misses         588      690     +102

Impacted Files	Coverage Δ
R/HelperFunctions.R	`66.66% <0.00%> (-21.57%)`	⬇️
R/Viewer.R	`0.00% <0.00%> (ø)`
R/Export.R	`89.44% <33.33%> (-0.42%)`	⬇️
R/PsFunctions.R	`81.63% <68.00%> (ø)`
R/RunAnalyses.R	`92.91% <77.77%> (ø)`
R/StudyPopulation.R	`94.41% <90.90%> (ø)`
R/DataLoadingSaving.R	`93.10% <92.59%> (+0.86%)`	⬆️
R/Balance.R	`76.67% <100.00%> (ø)`
R/OutcomeModels.R	`93.11% <100.00%> (ø)`
R/Simulation.R	`99.52% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

schuemie · 2023-04-11T14:04:17Z

R/DataLoadingSaving.R

  }
-  return(covariateSettings)
+
+preSample <- function(idx, colType, counts, preSampleCounts) {


In general, function names should be verb + noun. So here, maybe call it countPreSample()?

Done: 3056327.

schuemie · 2023-04-11T14:06:27Z

R/DataLoadingSaving.R

+      DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
+    ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
+    preSampleCounts <- dplyr::tibble(dummy = 0)
+    idx <- which(counts$treatment == 1)


Why not move the computing of idx to the preSample() function? You can pass which treatment (0 or 1) as an argument, which would allow you to remove the colType argument.

Done: 3056327.

However, the implementation is a bit more verbose, let me know what you prefer.

schuemie · 2023-04-11T14:09:08Z

R/DataLoadingSaving.R

+    counts <-
+      DatabaseConnector::querySql(connection, renderedSql, snakeCaseToCamelCase = TRUE)
+    ParallelLogger::logDebug("Pre-sample total row count is ", sum(counts$rowCount))
+    preSampleCounts <- dplyr::tibble(dummy = 0)


Instead of starting with a tibble with a dummy variable that preSample() operates on, why not have preSample() create a tibble with one row of variables, and then simply bind_cols() those variables for the target and comparator?

I miss read this comment.. I'll have a look

Done: bbbb1d4.

schuemie · 2023-04-11T14:13:45Z

Hi @mvankessel-EMC . I added some comments throughout.

I also see you're using a very specific code style. It is not clear to me why you sometimes break up a line into multiple lines. As a rule-of-thumb I try to break up lines if they exceed 80 characters (although I'll go to maximum of 100 characters if it reads better).

Also, this is my personal preference, but I prefer

value <- computeValue(argument)

instead of

value <- 
  computeValue(argument)

schuemie · 2023-04-12T13:02:27Z

@mvankessel-EMC : let me know when I can review again

mvankessel-EMC · 2023-04-13T08:25:14Z

@mvankessel-EMC : let me know when I can review again

Hi @schuemie, I pushed new updates. Latest commit: bbbb1d4.

Please let me know what you think.

schuemie · 2023-04-13T15:16:14Z

Looks great! As discussed, further refactoring would probably require turning the meta-data into some nice object that can be passed around by reference. But I'll merge what you've done so far, and leave it to you if you want to work on that.

mvankessel-EMC added 13 commits March 23, 2023 13:51

Offloaded parameter checks to seperate function

df17b36

Reverted OOS change

a719eef

Rstudio standard formatting

6fde503

env ref fix

1e7eec1

Added test

ee8dd67

Simple additional tests of getDbCohortMethodData break R-CMD-check

bd9550f

Restructured and Refactored

31ad5c5

Reverted changes

731177b

Only updated downSample

22e6f98

Deprecated removeDuplicateSubjects boolean support

804c814

Updated man page

18dffba

Updated tests, to comply with deprecated removeDuplicateSubjects bool…

632d08d

…ean support

Added preSample function

da3d147

mvankessel-EMC added the enhancement New functionality that could be added label Apr 11, 2023

mvankessel-EMC requested a review from schuemie April 11, 2023 13:00