Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Strategus module approach #145

Merged
merged 40 commits into from
Jul 31, 2024

Conversation

anthonysena
Copy link
Collaborator

@anthonysena anthonysena commented Jul 2, 2024

This draft pull request aims to address the following issues:

The current design is based on discussion in the Strategus HADES WG and I'll describe the changes in detail.

Overview

The major aim of v1.0 of Strategus is to in-source the modules that currently reside in their own code repositories (e.g. CohortGenerator module). To do this, we aim to provide a set of common interfaces that all modules will implement to perform their specific analytical use case. In addition, we'd like to remove the keyring, renv and targets dependencies as they created too much complexity and headaches. We will still use renv for dependency management; we will remove the use of renv::run() from Strategus and may still retain this dependency if we want to ensure that a project has properly initialized renv, etc.

This branch & PR assume that the structure of the analysis specification remains unchanged, mainly for my own sanity while working out these changes. More discussion on the structure of the analysis specification is happening in #144.

R/Module-StrategusModule.R

This file holds R6 Classes that aim to provide a common interface (base class in the case of R6) for all StrategusModules.

  • execute(connectionDetails, analysisSpecifications, executionSettings): execute the module
  • createResultsDataModel(resultsConnectionDetails, resultsDatabaseSchema, tablePrefix = ""): create the results data modelfor the module.
  • uploadResults(resultsConnectionDetails, analysisSpecifications, resultsUploadSettings): upload the module results
  • createModuleSpecifications(moduleSpecifications): create the module's specifications (inputs)
  • createSharedResourcesSpecifications(className, sharedResourcesSpecifications): create the shared resources specification
  • validateModuleSpecifications(moduleSpecifications): validate the module specifications
  • validateSharedResourcesSpecifications(className, sharedResourcesSpecifications): validate the shared resources specifications

I'll note that in the function signatures above, connectionDetails and resultsConnectionDetails are purposely separated from the execution settings so that we can serialize the execution settings without the worry that it may include sensitive information. This replaces the behavior from Strategus v0.x in which we had a connection details reference that we used to grab the credentials from keyring.

The StrategusModules base class does have some base implementation that is worth pointing out here:

  • A module's name & specification class are inferred from the R6 class name; this is set in the StrategusModules constructor.
  • In the case where a function specifies both the analysisSpecifications and executionSettings (cdm or results), this class builds a JobContext that is a private member of the base class. This is mainly for backwards compatibility since the v0.x module code uses the JobContext to access settings. I've aimed to make this an internal behavior of the StrategusModules so that users can focus on constructing analysis & execution settings. This should also help the developers who need to test their modules.
  • The create*Specifications and validate*Specifications are basic and simply confirm that the appropriate class name is applied to the settings. Modules can do further validation as necessary but these base methods provide an initial implementation that should be inherited.
  • The StrategusModules functions should provide type checking using checkmate to ensure that parameters are ready for the child class implementation.

I've included the following modules:

  • R/Module-Characterization
  • R/Module-CohortDiagnostics
  • R/Module-CohortGenerator: this is working off of the develop branch of CohortGenerator.
  • R/Module-CohortIncidence
  • R/Module-CohortMethod
  • R/Module-EvidenceSynthesis
  • R/Module-PatientLevelPrediction
  • R/Module-SelfControlledCaseSeries

Test Harness

There is a sample script in extras/R6ClassFun.R that aims to provide a simple test harness that will: create an analysis specification (only for CG and CM for now) , execute the analysis via Strategus, create an empty SQLite database, create the results schema tables & upload results. There is also some code to launch a Shiny results viewer to inspect the results. This has been my primary way of testing these changes and is a good reference for the main changes in this branch. To point out a few specific items:

Creating module settings: This is now done by creating an instance of the module class. For example:

cgModuleSettingsCreator <- CohortGeneratorModule$new()
# Create the settings & validate them
cohortSharedResourcesSpecifications <- cgModuleSettingsCreator$createCohortSharedResourceSpecifications(cohortDefinitionSet)
cgModuleSettingsCreator$validateCohortSharedResourceSpecifications(cohortSharedResourcesSpecifications)

ncoCohortSharedResourceSpecifications <- cgModuleSettingsCreator$createNegativeControlOutcomeCohortSharedResourceSpecifications(ncoCohortSet, "first", TRUE)
cgModuleSettingsCreator$validateNegativeControlOutcomeCohortSharedResourceSpecifications(ncoCohortSharedResourceSpecifications)

cgModuleSettings <- cgModuleSettingsCreator$createModuleSpecifications()
cgModuleSettingsCreator$validateModuleSpecifications(cgModuleSettings)

This removes the need to source("/SettingsFunction.R") from the module's repo as was done in v0.x. Additionally some basic validation may be performed by the module (as was desired in #9 but what is in this branch is not even close to a solution to that state).

Development Notes

  • I had some trouble getting module-specific log files created. I was finding that when I attached multiple loggers via ParallelLogger I faced problems whereby only the messaged logged through ParallelLogger's log*() functions wound up in the logs. For now the execute method of Strategus creates a single log file for the overall analysis run but I'd like to have a log file per module's results folder as we've done in v0.x. UPDATE: This is resolved in the most recent version of the branch.
  • Strategus v0.x had this notion of a moduleIndex in the code since an analysis specification could have multiple references to the same module. This feels overly complex since our packages are capable of handling multiple analyses and so I'd propose we restrict v1.x analysis specification to contain only 1 reference to a module.
  • Documenting R6 classes is a bit annoying - we can probably use roxygen templates for this but it doesn't seem possible to "inherit" documentation in a child class that implements a parent's function. Additionally, R6 classes are useful but not sufficient to enforce a real interface in the code. UPDATE: Now using roxygen templates which makes this much cleaner.
  • Base packages (such as CohortIncidence, EvidenceSynthesis) will require a modification to ensure they include the resultsDataModelSpecification.csv in the package. In Strategus v0.x, a module could hold the resultsDataModelSpecification.csv in the module's R project or in the package. By in-sourcing the modules, each package is required to put the resultsDataModelSpecification.csv in a well defined location so it may be included in the results output for the module. UPDATE: These modules will store their resultsDataModelSpecification.csv in Strategus for now. CohortIncidence will look to move this functionality into the package for the v5.x release line.
  • Result storage & upload: I'd propose that each module's results consist solely of .csv files (with file names that match the resultsDataModelSpecification.csv) and the resultsDataModelSpecification.csv. There is no need for a .zip file that contains the results - this can be done after the execution is complete. Noting this since CohortMethod's uploadResults function take a .zip file for input which was different from the pattern I had expected. Furthermore, I think each module package should adopt RMM for results table creation and upload where possible. This, in part, is to make sure that we can migrate results data model when necessary. Maybe more important than that though is so we can document the results data model per Document results data model for modules #143. UPDATE: This is also a note for CohortDiagnostics and SelfControlledCaseSeries.

I welcome any feedback/suggestions - tagging some of the HADES Strategus WG members including @schuemie @azimov @chrisknoll @jreps.

@schuemie
Copy link
Member

schuemie commented Jul 5, 2024

Excellent work @anthonysena ! This seems to cover pretty much everything we discussed. I'm almost sorry to see all that complicated code go ;-) Some notes:

  1. I'll dig into why ParallelLogger isn't behaving. It should definitely be possible to have multiple loggers at the same time.
  2. EvidenceSynthesis is somewhat of an outlier, since the base package is in CRAN and only holds the statistical functions. All the data wrangling is done in the module. I don't think we'd want all of that to go into the base package, also because it can't depend on RMM, which is not in CRAN. Currently the base package knows nothing about results models. I'm not sure if we should put the data wrangling in some new package, or leave it here in Strategus. I think for the foreseeable future, we can just move the resultsDataModelSpecification.csv into Strategus?

@anthonysena
Copy link
Collaborator Author

Thanks for the review of this PR @schuemie! I'm working out changes for the logging (I had missed creating the results folder).

Let me address your point about EvidenceSythessis: I think for ES, we can move the resultsDataModelSpecification.csv into Strategus since the logic for constructing the results lies in the module. For the other packages (e.g CohortIncidence), I'd prefer to see the resultsDataModelSpecification.csv reside in the package since the logic for creating the results lives there.

extras/R6ClassFun.R Outdated Show resolved Hide resolved
extras/R6ClassFun.R Outdated Show resolved Hide resolved
# This is the 1st commit message:

Debugging

# This is the commit message #2:

Change CG declaration; remove 4.2.3 specific tests

# This is the commit message #3:

Fix typo

# This is the commit message #4:

Mod linux setup

# This is the commit message #5:

Remove branch identifier for CG
@chrisknoll
Copy link
Contributor

Posting this here per request of @anthonysena 👍

JobContext vs (connectionDetails, analysisSpecifications, executionSettings):
Initially we had an object (a JobContext) that could encapsulate the design and the execution settings together to give us a single message format to pass into the module functions. The new form of the API accepts the 3 params separately, and depends on a parent-class function being called .createJobContext (which is invoked in the super of execute, upload, validate, etc). This leads to another dimensionality of understanding of what the classes at the parent level are doing, which normally should help you along with your implementation but forgetting or not calling super$ on the specific functions puts more of a burden than helping. I’ll talk about the idea around the R6 class hierarchy later.

I think we need to rethink what constitutes the 2 main elements of the strategus API: the analsyisSpec and the executionSettings. I’d like it if we constructed an R6 class of StrategusAnalysis which contains the elements for cohort defs, and each of our analysis specification (CI, CD, PLP, etc). We shouldn’t need to construct a job context, but rather if the PLP module wants the PLP input, it can just go to the Strategus analysis object and fetch the elements form it that contains the analysis spec for it. executionSettings is something separate which encapsulates information about the execution (what is the root folder, is there a table prefix for results, what are the schemas, what is the connection, etc). I think we should just reduce the api calls to needing one or both of the above, and not have connectionDetails as a separate param (we can get into details why that was necessary, unless it was a matter that you wanted to serialize executionSettings as JSON and not have connectionDetails in that payload but I don’t think you need to serialize executionSettings…it’s just encapsulating data)

Certain functions would only need analysisSpecifications as an input. I’m thinking validate() would be in this case. In CI’s case, I would fetch the IR design from the analysisSpecifications and do validation work on it (maybe I want to ensure that the target and outcome cohorts in the design have been defined in the cohortDefinitionSet for example). But since validation isn’t about execution, you wouldn’t need executionSettings in this context. Unfortunately, it looks like the validateModuleSpecifications takes a third construct: moduleSpecifications which is none of the above, and introduces a new type of information that I believe could be found inside the analysis specifications. Although, I also remember something about providing ‘module parameters’ like ‘include inclusion stats’ in cohortGenerator, but I think we should get away from these special locations of this type of information: if the analysis calls for inclusion rule statistics for cohort generator, it should be part of the analysis specs.

Logging:
Logging bit me a bit due to the way the inheritance was being implemented. It seemed the super$ of execute and upload (maybe validate?) would do some ‘class setup’ at the start of the function call to create loggers, and the problem is that if you use execute that calls over to validate and each of those methods do a super$ to initialize a log, they will both try to open an output file to log and result in an error. So, I think we should take logging out of classes, and just use the standard R api to log (like message, warn, etc) and I believe ParallelLogger handles redirecting those calls to the target file somehow. I’m not exactly sure how ParallelLogger does this, but I can say from the many other logging frameworks I have worked with, you set up your logging configuration at the application level (such as location of the log file, how it should rollover (either by time or size) and even specify what logging levels you want to write to the log based on the logging level and you write to the logs using a standard logging API. I’m not sure we have anything this sophisticated in R, but I think we’re attempting to reinvent the logging wheel in strategus and I think we are going against some standard/best practices in logging.

R6 Class inheritance:
I think we’re depending too heavily on implementing a class structure where child classes have dependencies on behavior of the parent, making it more complicated to implement a module because you have to understand the details of the parent in order to fully implement the child. This can work but only after very careful planning. In all my years, I have found an effective class hierarchy emerges over time, or when there was a very specific plan about how the parent’s class behavior was to be extended by the child. I think the classic effective inheritance application is the notion of an abstract base class where you have the general process defined in the parent and the children ‘fill in the blanks’ by implementing the parent’s abstract methods. However, that’s not what’s happening in the Strategus module design. I recommend we strip most (if not all except the core module functions) of the inheritance and instead just have a base class that provides the ‘not implemented’ functions for the 3 or 4 methods (execute, validate, upload, createResultsModel), and have the children responsible for ‘filling in the blanks’ of how to implement those operations. If we see that some of these functions depend on similar functions (such as: get the the moduleSettings for this module from the analysisSettings), I’d rather have a utility function that allows the reuse of the operation (and easier unit testing) than a parent function that is inherited to the child. To this point: it’s not specifically a module function to try to extract module settings from an analysisSpec. I definitely want to avoid ‘special parent state’ that needs to be initialized in order for the child to function properly. This just leads to error.

Strategus Design spec R6 Class:
I feel like this has been put off for a long time, and I’d like to not push this off any further and build the StrategusDesignSpec class that encapsulate the elements of a Strategus analysis. This will simplify the navigation to the analysis elements. My worry is that we’ll get to it all ‘working’ under the old model, and it’s going to be too painful to make the switch (which will fundamentally impact all the modules because the analysis spec is changing). So let’s bite that bullet now, I’m an 100% confident that a 3 hour focus time can specify the design elements (we already know what goes into it, it’s just arranging it in a simpler form that is easier to navigate). Once we have this, then the analsysSpecs params of all of our Module functions become a very easy to navigate and known quantity that each module can consume to do their work. Let’s carve out the time, get it done, and then adjust the modules to work with the new structure.

I thought analysisSpecification was a class, I tried to search for it in RStudio but only found a AnalysisSpecifications.R under man-roxygen, which led me to createEmtpyAnalysisSpecifciations…currently the analysisSpec is a list of sharedReosurces and moduleSpecifications, with a class() attribute of AnalysisSpecifciations. Let’s make this a full blown R6 class.

@anthonysena
Copy link
Collaborator Author

Thanks @chrisknoll for the review! Let me respond to your feedback, in a slightly different order:

Logging:
Logging bit me a bit due to the way the inheritance was being implemented. It seemed the super$ of execute and upload (maybe validate?) would do some ‘class setup’ at the start of the function call to create loggers, and the problem is that if you use execute that calls over to validate and each of those methods do a super$ to initialize a log, they will both try to open an output file to log and result in an error. So, I think we should take logging out of classes, and just use the standard R api to log (like message, warn, etc) and I believe ParallelLogger handles redirecting those calls to the target file somehow. I’m not exactly sure how ParallelLogger does this, but I can say from the many other logging frameworks I have worked with, you set up your logging configuration at the application level (such as location of the log file, how it should rollover (either by time or size) and even specify what logging levels you want to write to the log based on the logging level and you write to the logs using a standard logging API. I’m not sure we have anything this sophisticated in R, but I think we’re attempting to reinvent the logging wheel in strategus and I think we are going against some standard/best practices in logging.

Agreed - I've removed the logging piece of the StrategusModule per 87b33b4.

JobContext vs (connectionDetails, analysisSpecifications, executionSettings):
Initially we had an object (a JobContext) that could encapsulate the design and the execution settings together to give us a single message format to pass into the module functions. The new form of the API accepts the 3 params separately, and depends on a parent-class function being called .createJobContext (which is invoked in the super of execute, upload, validate, etc). This leads to another dimensionality of understanding of what the classes at the parent level are doing, which normally should help you along with your implementation but forgetting or not calling super$ on the specific functions puts more of a burden than helping. I’ll talk about the idea around the R6 class hierarchy later.

I think we need to rethink what constitutes the 2 main elements of the strategus API: the analsyisSpec and the executionSettings. I’d like it if we constructed an R6 class of StrategusAnalysis which contains the elements for cohort defs, and each of our analysis specification (CI, CD, PLP, etc). We shouldn’t need to construct a job context, but rather if the PLP module wants the PLP input, it can just go to the Strategus analysis object and fetch the elements form it that contains the analysis spec for it. executionSettings is something separate which encapsulates information about the execution (what is the root folder, is there a table prefix for results, what are the schemas, what is the connection, etc). I think we should just reduce the api calls to needing one or both of the above, and not have connectionDetails as a separate param (we can get into details why that was necessary, unless it was a matter that you wanted to serialize executionSettings as JSON and not have connectionDetails in that payload but I don’t think you need to serialize executionSettings…it’s just encapsulating data)

Certain functions would only need analysisSpecifications as an input. I’m thinking validate() would be in this case. In CI’s case, I would fetch the IR design from the analysisSpecifications and do validation work on it (maybe I want to ensure that the target and outcome cohorts in the design have been defined in the cohortDefinitionSet for example). But since validation isn’t about execution, you wouldn’t need executionSettings in this context. Unfortunately, it looks like the validateModuleSpecifications takes a third construct: moduleSpecifications which is none of the above, and introduces a new type of information that I believe could be found inside the analysis specifications. Although, I also remember something about providing ‘module parameters’ like ‘include inclusion stats’ in cohortGenerator, but I think we should get away from these special locations of this type of information: if the analysis calls for inclusion rule statistics for cohort generator, it should be part of the analysis specs.

Strategus Design spec R6 Class:
I feel like this has been put off for a long time, and I’d like to not push this off any further and build the StrategusDesignSpec class that encapsulate the elements of a Strategus analysis. This will simplify the navigation to the analysis elements. My worry is that we’ll get to it all ‘working’ under the old model, and it’s going to be too painful to make the switch (which will fundamentally impact all the modules because the analysis spec is changing). So let’s bite that bullet now, I’m an 100% confident that a 3 hour focus time can specify the design elements (we already know what goes into it, it’s just arranging it in a simpler form that is easier to navigate). Once we have this, then the analsysSpecs params of all of our Module functions become a very easy to navigate and known quantity that each module can consume to do their work. Let’s carve out the time, get it done, and then adjust the modules to work with the new structure.

I thought analysisSpecification was a class, I tried to search for it in RStudio but only found a AnalysisSpecifications.R under man-roxygen, which led me to createEmtpyAnalysisSpecifciations…currently the analysisSpec is a list of sharedReosurces and moduleSpecifications, with a class() attribute of AnalysisSpecifciations. Let’s make this a full blown R6 class.

There are a few ideas in this feedback so let me take them one at a time. First, I retained the notion of a jobContext specifically to keep the scope of this particular feature branch small. This was a practical consideration since it made the job of moving the code from 7 different repos into Strategus much easier - I can visually inspect the code in this branch and the corresponding source code repository to verify it is the same. When this work is merged into the v1.0-main branch, I'll squash this work down and we'll have a commit that shows that we've moved the code. We can and will continue to refactor from this point forward until we release v1.0. Please view the JobContext class as something internal to the behavior of the modules for now and simply to help us move forward with the refactoring. It may be completely refactored out later as we address other issues, such as changing the analysis specification.

Changing the analysis specification is an issue for #144 and we can revisit these ideas there, including how modules work with the analysis specification. I think making it an R6 class makes sense. To your point about moduleSpecifications vs. analysisSpecfications: an analysis specification is made up of 1 or more module specifications so my thinking is that a module is only responsible for confirming that its settings are valid, not the entire analysis specification. Additionally, we'd like to be able to validate the module specifications when assembling the analysis specification. We probably should have a validate(analyisisSpecification) function in Strategus that can work with the analysis specification and use the module to validate its individual parts. We can open a related issue on this topic if we want to discuss validation and its impact on the API.

I'd also prefer to keep the connection details parameter so that we can consider both the analysis specifications and the execution settings when performing an incremental run as requested in #95. I'd like to avoid putting anything that requires sensitive information (like the connection details) in the execution settings so we can serialize those settings and use them to determine when a module needs to run based on the inputs changing.

R6 Class inheritance:
I think we’re depending too heavily on implementing a class structure where child classes have dependencies on behavior of the parent, making it more complicated to implement a module because you have to understand the details of the parent in order to fully implement the child. This can work but only after very careful planning. In all my years, I have found an effective class hierarchy emerges over time, or when there was a very specific plan about how the parent’s class behavior was to be extended by the child. I think the classic effective inheritance application is the notion of an abstract base class where you have the general process defined in the parent and the children ‘fill in the blanks’ by implementing the parent’s abstract methods. However, that’s not what’s happening in the Strategus module design. I recommend we strip most (if not all except the core module functions) of the inheritance and instead just have a base class that provides the ‘not implemented’ functions for the 3 or 4 methods (execute, validate, upload, createResultsModel), and have the children responsible for ‘filling in the blanks’ of how to implement those operations. If we see that some of these functions depend on similar functions (such as: get the the moduleSettings for this module from the analysisSettings), I’d rather have a utility function that allows the reuse of the operation (and easier unit testing) than a parent function that is inherited to the child. To this point: it’s not specifically a module function to try to extract module settings from an analysisSpec. I definitely want to avoid ‘special parent state’ that needs to be initialized in order for the child to function properly. This just leads to error.

I chose to use an inheritance structure here on purpose since I want to mandate certain behavior of a StrategusModule. Most of what is in the base class at this point is: defining how a module is named, how its settings class is defined, validating inputs for the function (which is where child classes can use super$ to access them) and to build the jobContext. Putting aside the jobContext since I expect that will change, I think the inheritance structure works well since I'd like to make sure that each child module has some standard behavior defined at the parent level. I think the child classes are still "filling in the blanks" to your point with the exception of the jobContext which we can hopefully refactor out as we move forward. If we need to refactor out utility functions to expose them at the package level, I'm open to that as well but for now I like the encapsulation of having private functions for the needs of the module.

anthonysena and others added 3 commits July 18, 2024 11:14
* Implementation of CohortIncidence module.
Fixed param typo in ResultDataModel.R

* Set runCheckAndFixCommands = TRUE to upload database_id properly.

* Added target_outcome_xref to results model.

* runCheckAndFixCommands = TRUE in DatabaseMetaData.R.
@anthonysena anthonysena marked this pull request as ready for review July 31, 2024 15:10
@anthonysena anthonysena merged commit 0a0171b into v1.0-main Jul 31, 2024
2 of 8 checks passed
@anthonysena anthonysena deleted the remove-deps-add-module-interface branch September 26, 2024 13:39
anthonysena added a commit that referenced this pull request Oct 7, 2024
* Implementation of CohortIncidence module. (#147)
* Adjustments from testing

---------

Co-authored-by: Chris Knoll <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants