-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add three new temporal checks: plausibleAfterBirth
, plausibleBeforeDeath
, plausibleStartBeforeEnd
#516
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #516 +/- ##
===========================================
+ Coverage 86.42% 86.48% +0.06%
===========================================
Files 16 16
Lines 884 888 +4
===========================================
+ Hits 764 768 +4
Misses 120 120 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we keep plausibleTemporalAfter and plausibleDuringLife checks?
aren't they covered now by these new 3 checks?
Removing the 'replaced' checks might be a breaking change for some people (e.g. when comparing number of violating records over time). My thinking was to implement a 'soft' disable, making it possible to execute the old checks if desired. e.g. by removing the plausibleDuringLife and plausibleTemportalAfter checks from the |
R/executeDqChecks.R
Outdated
@@ -252,6 +253,10 @@ executeDqChecks <- function(connectionDetails, | |||
|
|||
if (length(checkNames) > 0) { | |||
checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ] | |||
} else { | |||
# No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimshitc This disables the old checks, if user did not specify them specifically. We do need to document this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me.
Should we set threshold in all these checks to 0? as these are obvious mistakes and shouldn't be in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.
Does that sound OK to you guys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes even more sense. I added the following warnings if one of the old checks is run:
"DEPRECATION WARNING - The plausibleDuringLife check has been reimplemented with the plausibleBeforeDeath check."
DEPRECATION WARNING - The plausibleTemporalAfter check has been reimplemented with the plausibleAfterBirth and plausibleStartBeforeEnd checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding disabling the checks by the user; there is not really a nice solution as executeDqChecks
only takes the names of checks to be run, not an argument with checks to exclude. So the user needs to list all checks except these two. Or does one of you see a more elegant solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warnings look great; thanks!
And I wouldn't call it elegant, but I did add these instructions to the docs earlier this year on how to exclude a list of checks:
DataQualityDashboard/vignettes/DataQualityDashboard.rmd
Lines 114 to 120 in 81230bd
# want to EXCLUDE a pre-specified list of checks? run the following code: | |
# | |
# checksToExclude <- c() # Names of check types to exclude from your DQD run | |
# allChecks <- DataQualityDashboard::listDqChecks() | |
# checkNames <- allChecks$checkDescriptions %>% | |
# subset(!(checkName %in% checksToExclude)) %>% | |
# select(checkName) |
I think for now we can point to this in the release notes. In the future I'd love to improve the workflow for enabling/disabling checks and editing thresholds (a la #253 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this @MaximMoinat ! I left a couple suggestions.
R/executeDqChecks.R
Outdated
@@ -252,6 +253,10 @@ executeDqChecks <- function(connectionDetails, | |||
|
|||
if (length(checkNames) > 0) { | |||
checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ] | |||
} else { | |||
# No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.
Does that sound OK to you guys?
p.year_of_birth, '-', | ||
COALESCE(p.month_of_birth, 1), '-', | ||
COALESCE(p.day_of_birth, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will actually want to use the date format YYYYMMDD
here to ensure we're compatible with all SQL dialects (see this thread here where I was discussing this with Martijn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I actually came across a similar issues in sqlite.
Not a trivial change, but I came up with the following code, which is a bit complex but should be compatible with all Sql dialects:
CAST(CONCAT(
p.year_of_birth,
COALESCE(
RIGHT('00' + CAST(p.month_of_birth AS VARCHAR), 2),
'01'
),
COALESCE(
RIGHT('00' + CAST(p.day_of_birth AS VARCHAR), 2),
'01'
)
) AS DATE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, crafty little solution! only question - do we need 00
or just 0
there? (won't actually impact the result)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted, will change it
AND c.COHORT_DEFINITION_ID = @cohortDefinitionId | ||
} | ||
JOIN @cdmDatabaseSchema.person p ON cdmTable.person_id = p.person_id | ||
WHERE CAST(cdmTable.@cdmFieldName AS DATE) < COALESCE( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think to be safe (here and in other queries) we should also have a condition that the field is non-null.
also what do you think about adding that condition to the denominator as well? i.e. among all records where we have a date, that date must be plausible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, added same condition for the other two checks as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I promise this is my last review 😅 I did a deep review and left a couple more comments on the SQL comments. I also noticed the following in the threshold files (I only checked 5.4 but assume they're all the same):
- EPISODE start_date and start_datetime are missing a threshold for plausibleStartBeforeEnd
- The plausibleStartBeforeEnd threshold for OBSERVATION_PERIOD is 1. Should this actually be 0, since this is a derived table and folks can control it in ETL (and also would prob cause some pretty weird stuff in analytics if the dates were flipped)?
- plausibleAfterBirth and plausibleBeforeDeath need to be activated for MEASUREMENT.measurement_datetime
Thanks @katy-sadowski for this thorough review, I agree with all your comments and adjusted the code accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks so much @MaximMoinat! I will go ahead and merge this into develop.
As discussed in #454: these checks separate out the quality checks in plausibleDuringLife and plausibleTemportalAfter.
plausibleAfterBirth
checks for any events happening before birth (was implicitly checked in plausibleTemportalAfter)plausibleBeforeDeath
checks for any events happening after death (improving upon implementation of plausibleDuringLife)plausibleStartBeforeEnd
checks for end dates that are before start date in the same table (specific implementation of plausibleTemportalAfter)TODO:
plausibleDuringLife
andplausibleTemportalAfter
by default. These are redundant if the three new checks are run, but kept in for backwards compatibility.fixes #513 and fixes #445