Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add three new temporal checks: plausibleAfterBirth, plausibleBeforeDeath, plausibleStartBeforeEnd #516

Merged
merged 12 commits into from
Jan 6, 2024

Conversation

MaximMoinat
Copy link
Collaborator

@MaximMoinat MaximMoinat commented Dec 21, 2023

As discussed in #454: these checks separate out the quality checks in plausibleDuringLife and plausibleTemportalAfter.

  • plausibleAfterBirth checks for any events happening before birth (was implicitly checked in plausibleTemportalAfter)
  • plausibleBeforeDeath checks for any events happening after death (improving upon implementation of plausibleDuringLife)
  • plausibleStartBeforeEnd checks for end dates that are before start date in the same table (specific implementation of plausibleTemportalAfter)

TODO:

  • Add threshold settings for v5.2 and v5.3 (after validating that this works for v5.4)
  • To discuss: disable plausibleDuringLife and plausibleTemportalAfter by default. These are redundant if the three new checks are run, but kept in for backwards compatibility.

fixes #513 and fixes #445

Copy link

codecov bot commented Dec 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2883027) 86.42% compared to head (cdb6db7) 86.48%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #516      +/-   ##
===========================================
+ Coverage    86.42%   86.48%   +0.06%     
===========================================
  Files           16       16              
  Lines          884      888       +4     
===========================================
+ Hits           764      768       +4     
  Misses         120      120              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@dimshitc dimshitc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we keep plausibleTemporalAfter and plausibleDuringLife checks?
aren't they covered now by these new 3 checks?

@MaximMoinat
Copy link
Collaborator Author

why do we keep plausibleTemporalAfter and plausibleDuringLife checks? aren't they covered now by these new 3 checks?

Removing the 'replaced' checks might be a breaking change for some people (e.g. when comparing number of violating records over time). My thinking was to implement a 'soft' disable, making it possible to execute the old checks if desired. e.g. by removing the plausibleDuringLife and plausibleTemportalAfter checks from the checkDescriptions object here, and only keep them if specified specifically in the checkNames parameter.

@@ -252,6 +253,10 @@ executeDqChecks <- function(connectionDetails,

if (length(checkNames) > 0) {
checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ]
} else {
# No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimshitc This disables the old checks, if user did not specify them specifically. We do need to document this behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me.

Should we set threshold in all these checks to 0? as these are obvious mistakes and shouldn't be in the data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.

Does that sound OK to you guys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes even more sense. I added the following warnings if one of the old checks is run:
"DEPRECATION WARNING - The plausibleDuringLife check has been reimplemented with the plausibleBeforeDeath check."

DEPRECATION WARNING - The plausibleTemporalAfter check has been reimplemented with the plausibleAfterBirth and plausibleStartBeforeEnd checks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding disabling the checks by the user; there is not really a nice solution as executeDqChecks only takes the names of checks to be run, not an argument with checks to exclude. So the user needs to list all checks except these two. Or does one of you see a more elegant solution?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings look great; thanks!

And I wouldn't call it elegant, but I did add these instructions to the docs earlier this year on how to exclude a list of checks:

# want to EXCLUDE a pre-specified list of checks? run the following code:
#
# checksToExclude <- c() # Names of check types to exclude from your DQD run
# allChecks <- DataQualityDashboard::listDqChecks()
# checkNames <- allChecks$checkDescriptions %>%
# subset(!(checkName %in% checksToExclude)) %>%
# select(checkName)

I think for now we can point to this in the release notes. In the future I'd love to improve the workflow for enabling/disabling checks and editing thresholds (a la #253 ).

Copy link
Collaborator

@katy-sadowski katy-sadowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this @MaximMoinat ! I left a couple suggestions.

@@ -252,6 +253,10 @@ executeDqChecks <- function(connectionDetails,

if (length(checkNames) > 0) {
checkDescriptionsDf <- checkDescriptionsDf[checkDescriptionsDf$checkName %in% checkNames, ]
} else {
# No checkNames specified. Do not run plausibleDuringLife and plausibleTemportalAfter as these are superseded three new checks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking this over. I think we should actually enable all of the checks (new & old) by default, and instruct users how to disable the ones they don't want in the documentation. Otherwise this is essentially a breaking change because users relying on the old behavior will need to take action to adapt to the change. We can also add a warning that raises when the 2 old checks are enabled to alert users that the new checks are preferred.

Does that sound OK to you guys?

Comment on lines 45 to 47
p.year_of_birth, '-',
COALESCE(p.month_of_birth, 1), '-',
COALESCE(p.day_of_birth, 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will actually want to use the date format YYYYMMDD here to ensure we're compatible with all SQL dialects (see this thread here where I was discussing this with Martijn).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I actually came across a similar issues in sqlite.

Not a trivial change, but I came up with the following code, which is a bit complex but should be compatible with all Sql dialects:

CAST(CONCAT(
    p.year_of_birth,
    COALESCE(
        RIGHT('00' + CAST(p.month_of_birth AS VARCHAR), 2),
        '01'
    ),
    COALESCE(
        RIGHT('00' + CAST(p.day_of_birth AS VARCHAR), 2),
        '01'
    )
) AS DATE)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, crafty little solution! only question - do we need 00 or just 0 there? (won't actually impact the result)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, will change it

AND c.COHORT_DEFINITION_ID = @cohortDefinitionId
}
JOIN @cdmDatabaseSchema.person p ON cdmTable.person_id = p.person_id
WHERE CAST(cdmTable.@cdmFieldName AS DATE) < COALESCE(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think to be safe (here and in other queries) we should also have a condition that the field is non-null.

also what do you think about adding that condition to the denominator as well? i.e. among all records where we have a date, that date must be plausible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, added same condition for the other two checks as well

inst/sql/sql_server/field_plausible_before_death.sql Outdated Show resolved Hide resolved
Copy link
Collaborator

@katy-sadowski katy-sadowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I promise this is my last review 😅 I did a deep review and left a couple more comments on the SQL comments. I also noticed the following in the threshold files (I only checked 5.4 but assume they're all the same):

  • EPISODE start_date and start_datetime are missing a threshold for plausibleStartBeforeEnd
  • The plausibleStartBeforeEnd threshold for OBSERVATION_PERIOD is 1. Should this actually be 0, since this is a derived table and folks can control it in ETL (and also would prob cause some pretty weird stuff in analytics if the dates were flipped)?
  • plausibleAfterBirth and plausibleBeforeDeath need to be activated for MEASUREMENT.measurement_datetime

inst/sql/sql_server/field_plausible_after_birth.sql Outdated Show resolved Hide resolved
inst/sql/sql_server/field_plausible_start_before_end.sql Outdated Show resolved Hide resolved
inst/sql/sql_server/field_plausible_after_birth.sql Outdated Show resolved Hide resolved
inst/sql/sql_server/field_plausible_before_death.sql Outdated Show resolved Hide resolved
inst/sql/sql_server/field_plausible_start_before_end.sql Outdated Show resolved Hide resolved
@MaximMoinat
Copy link
Collaborator Author

Thanks @katy-sadowski for this thorough review, I agree with all your comments and adjusted the code accordingly.

Copy link
Collaborator

@katy-sadowski katy-sadowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks so much @MaximMoinat! I will go ahead and merge this into develop.

@katy-sadowski katy-sadowski merged commit 1c48fcb into develop Jan 6, 2024
10 checks passed
@MaximMoinat MaximMoinat deleted the issue-445 branch May 10, 2024 11:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new check New DQ check to be added
Projects
None yet
3 participants