Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaData class proposal #142

Open
mvankessel-EMC opened this issue May 17, 2023 · 5 comments
Open

MetaData class proposal #142

mvankessel-EMC opened this issue May 17, 2023 · 5 comments

Comments

@mvankessel-EMC
Copy link
Collaborator

mvankessel-EMC commented May 17, 2023

Hi @schuemie,

As discussed a proposal for the MetaData class for CohortMethod. I've implemented a first version of the class in R6.

The code is available in the R/R6-MetaData.R file, on the R6-proposal branch.

In this post I will go through the implementation bit by bit.

Class definition

Public

Fields

The following fields are specified:

targetId = 0,
comparatorId = 0,
studyStartDate = "",
studyEndDate = "",
attrition = NULL,
outcomeIds = NULL,
populationSize = 0,
deletedRedundantCovariateIds = NULL,
deletedInfrequentCovariateIds = NULL,
deletedRedundantCovariateIdsForOutcomeModel = NULL,
deletedInfrequentCovariateIdsForOutcomeModel = NULL,
psModelCoef = NULL,
psModelPriorVariance = NULL,
psError = "",
psHighCorrelation = NULL,
estimator = "att",

targetId, comparatorId, studyStartDate, and studyEndDate are required when initializing an instance of the MetaData class.

As of right now it is not entirely clear to me yet what parameters can be private, as I do prefer to store fields privately. Right now I'm thinking about the fields required for initialization.

Sources of fields:
targetId
comparatorId
studyStartDate
studyEndDate
attrition
outcomeIds
populationSize
deletedRedundantCovariateIds
deletedInfrequentCovariateIds
deletedRedundantCovariateIdsForOutcomeModel
deletedInfrequentCovariateIdsForOutcomeModel
psModelCoef
psModelPriorVariance
psError
psHighCorrelation
estimator

Methods

initialize

initialize = function(targetId, comparatorId, studyStartDate, studyEndDate) {
      self$targetId <- targetId
      self$comparatorId <- comparatorId
      self$studyStartDate <- studyStartDate
      self$studyEndDate <- studyEndDate
      private$formatStudyDates()

      self$validate()
      return(invisible(self))
    },

Initializer method used when MetaData$new() is called. The validate() method is called when a new object is initialized.

validate

validate = function() {
      errorMessages <- checkmate::makeAssertCollection()

      checkmate::assertInt(self$targetId, add = errorMessages)
      checkmate::assertInt(self$comparatorId, add = errorMessages)
      checkmate::assertCharacter(self$studyStartDate, len = 1, add = errorMessages)
      checkmate::assertCharacter(self$studyEndDate, len = 1, add = errorMessages)
      checkmate::assertDataFrame(self$attrition, null.ok = TRUE)
      checkmate::assertInt(self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
      checkmate::assertInt(self$populationSize, lower = 0)
      checkmate::assertInt(self$deletedRedundantCovariateIds, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
      checkmate::assertInt(self$deletedInfrequentCovariateIds, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
      checkmate::assertInt(self$deletedRedundantCovariateIdsForOutcomeModel, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
      checkmate::assertInt(self$deletedInfrequentCovariateIdsForOutcomeModel, self$outcomeIds, na.ok = FALSE, null.ok = TRUE)
      checkmate::assertNumeric(self$psModelCoef, null.ok = TRUE)
      checkmate::assertNumeric(self$psModelPriorVariance, null.ok = TRUE)
      checkmate::assertCharacter(self$psError)
      checkmate::assertDataFrame(self$psHighCorrelation, null.ok = TRUE)
      checkmate::assertChoice(self$estimator, c("ate", "att", "ato"), add = errorMessages)

      checkmate::reportAssertions(collection = errorMessages)
      return(invisible(self))
    },

Validation method that validates each field (taken from: DataLoadingSaving.R, psFunctions.R).

getMetaData

getMetaData = function() {
      return(list(
        targetId = self$targetId,
        comparatorId = self$comparatorId,
        studyStartDate = self$studyStartDate,
        studyEndDate = self$studyEndDate,
        attrition = self$attrition,
        outcomeIds = self$outcomeIds,
        populationSize = self$populationSize,
        deletedRedundantCovariateIds = self$deletedRedundantCovariateIds,
        deletedInfrequentCovariateIds = self$deletedInfrequentCovariateIds,
        deletedRedundantCovariateIdsForOutcomeModel = self$deletedRedundantCovariateIdsForOutcomeModel,
        deletedInfrequentCovariateIdsForOutcomeModel = self$deletedInfrequentCovariateIdsForOutcomeModel,
        psModelCoef = self$psModelCoef,
        psModelPriorVariance = self$psModelPriorVariance,
        psError = self$psError,
        psHighCorrelation = self$psHighCorrelation,
        estimator = self$estimator
      ))
    },

Method to get all specified fields returned in a list. Individual public fields can be optained like so:

metaData <- MetaData$new(targetId = 1, comparatorId = 2, studyStartDate = "", studyEndDate = "")

# Get psError
metaData$psError

print (Overload)

print = function(x, ...) {
      writeLines(paste("Class:", paste0(class(self), collapse = " ")))
      writeLines(paste("Target ID: ", self$targetId))
      writeLines(paste("Comparator ID: ", self$comparatorId))
      writeLines(paste("Study Start Date: ", self$studyStartDate))
      writeLines(paste("Study End Date: ", self$studyEndDate))
      writeLines(paste("Attrition: ", dim(self$attrition)))
      writeLines(paste("Number of Outcome IDs: ", length(self$outcomeIds)))
      writeLines(paste("Population size: ", self$populationSize))
      writeLines(paste("Number of redunded covariate IDs deleted: ", self$deletedRedundantCovariateIds))
      writeLines(paste("Number of infrequent covariate IDs deleted: ", self$deletedInfrequentCovariateIds))
      writeLines(paste("Number of redunded outcome model covariate IDs deleted: ", self$deletedRedundantCovariateIdsForOutcomeModel))
      writeLines(paste("Number of infrequent outcome model covariate IDs deleted: ", self$deletedInfrequentCovariateIdsForOutcomeModel))
      writeLines(paste("Propensity Score Model Coefficient: ", self$psModelCoef))
      writeLines(paste("Propensity Score Model Variance: ", self$psModelPriorVariance))
      writeLines(paste("Propensity Score Error", self$psError))
      writeLines(paste("High Correlation Propensity Scores: ", dim(self$psHighCorrelation)))
      writeLines(paste("Estimator: ", self$estimator))
      return(invisible(self))
    }

Overload the print generic to nicely print the current fields.

print(metaData)

Class: MetaData R6
Target ID:  1
Comparator ID:  2
Study Start Date:  
Study End Date:  
Attrition:  
Number of Outcome IDs:  0
Population size:  0
Number of redunded covariate IDs deleted:  
Number of infrequent covariate IDs deleted:  
Number of redunded outcome model covariate IDs deleted:  
Number of infrequent outcome model covariate IDs deleted:  
Propensity Score Model Coefficient:  
Propensity Score Model Variance:  
Propensity Score Error 
High Correlation Propensity Scores:  
Estimator:  att

Private

Methods

formatStudyDate

formatStudyDates = function() {
      if (is.null(self$studyStartDate)) {
        self$studyStartDate <- ""
      }
      if (is.null(self$studyEndDate)) {
        self$studyEndDate <- ""
      }
      if (self$studyStartDate != "" &&
          regexpr("^[12][0-9]{3}[01][0-9][0-3][0-9]$", self$studyStartDate) == -1) {
        stop("Study start date must have format YYYYMMDD")
      }
      if (self$studyEndDate != "" &&
          regexpr("^[12][0-9]{3}[01][0-9][0-3][0-9]$", self$studyEndDate) == -1) {
        stop("Study end date must have format YYYYMMDD")
      }
      return(invisible(self))
    }

Method to format the study end and start dates (from: DataLoadingSaving.R).


The the following section an outcomeModel class is being specified using metaData.

  outcomeModel <- metaData
  outcomeModel$outcomeModelTreatmentVarId <- treatmentVarId
  outcomeModel$outcomeModelCoefficients <- coefficients
  outcomeModel$logLikelihoodProfile <- logLikelihoodProfile
  outcomeModel$outcomeModelPriorVariance <- priorVariance
  outcomeModel$outcomeModelLogLikelihood <- logLikelihood
  outcomeModel$outcomeModelType <- modelType
  outcomeModel$outcomeModelStratified <- stratified
  outcomeModel$outcomeModelUseCovariates <- useCovariates
  outcomeModel$inversePtWeighting <- inversePtWeighting
  outcomeModel$outcomeModelTreatmentEstimate <- treatmentEstimate
  outcomeModel$outcomeModelmainEffectEstimates <- mainEffectEstimates
  if (length(interactionCovariateIds) != 0) {
    outcomeModel$outcomeModelInteractionEstimates <- interactionEstimates
  }
  outcomeModel$outcomeModelStatus <- status
  outcomeModel$populationCounts <- getCounts(population, "Population count")
  outcomeModel$outcomeCounts <- getOutcomeCounts(population, modelType)
  outcomeModel$timeAtRisk <- getTimeAtRisk(population, modelType)
  if (!is.null(subgroupCounts)) {
    outcomeModel$subgroupCounts <- subgroupCounts
  }
  class(outcomeModel) <- "OutcomeModel"

My suggestion would be making another class called OutcomeModel, which inherits from MetaData, extending the functionality.

@schuemie
Copy link
Member

Thanks!

The meta data keeps growing through the pipeline (e.g. psModelCoef doesn't get added until createPs is called). I wonder if we should use a more 'compositional' approach? So we have data-loading meta data, PS-model meta-data, that together combine in to an overall meta data object. What do you think?

Should we have a separate class for attrition?

@mvankessel-EMC
Copy link
Collaborator Author

I think splitting out the metadata in two different classes is a good approach. I think for inheritance sake, it would be best for the "data-loading meta data" to inherit from the "PS-model meta data" class, as in my first example the only private function pertains the "data-loading meta data" class. The print method is then inherited. The fields can be packaged up in either a named list or data.frame, keeping fields ambiguous for the print method, and any other methods we might think of in future.

MetaDataPS
Attributes
+ fields (data.frame/list)
Methods
+ initialize
+ print
+ getMetaData
- validate

The fields data.frame or list would contain: outcomeIds, populationSize, deletedRedundantCovariateIds, deletedInfrequentCovariateIds, deletedRedundantCovariateIdsForOutcomeModel, deletedInfrequentCovariateIdsForOutcomeModel, psModelCoef, psModelPriorVariance psError, psHighCorrelation, estimator.

MetaDataLoading (MetaDataPS)
Attributes
Methods
+ initialize
- validae
- formatStudyDates

* Cursive names are overloaded methods

The fields data.frame or list would contain: targetId, comparatorId, studyStartDate, studyEndDate.

outcomeModel would then also be able to inherit from MetaDataPS:

OutcomeModel (MetaDataPS)
**Attributes **
Methods
+ initialize
+ coef
+ confint
- validate

The fields data.frame or list would contain: outcomeModelTreatmentVarId, outcomeModelCoefficients, logLikelihoodProfile, outcomeModelPriorVariance, outcomeModelLogLikelihood, outcomeModelType, outcomeModelStratified, outcomeModelUseCovariates, inversePtWeighting, outcomeModelTreatmentEstimate, outcomeModelmainEffectEstimates, outcomeModelInteractionEstimates, outcomeModelStatus, populationCounts, outcomeCounts, timeAtRisk, subgroupCounts.

coef would overload stats::coef and confint would overload stats::confint.

Regarding attrition, I think we could handle it in a similar manner:

Attrition (MetaDataPS)
Attributes
Methods
+ initialize
- validate

The fields data.frame or list would contain: description, targetPersons, comparatorPersons, targetExposures, comparatorExposures, rowCount, treatment, personCount.


A couple of choices that need to be made:

  1. Do we allow fields to be freely editable, as a public attribute?
  2. Generalize Class and method naming, as this extends beyond Metadata.
  3. Are there any listed fields attributes that would be useful to have in multiple classes?

@schuemie
Copy link
Member

schuemie commented Jun 6, 2023

Should MetaDataPS inherit from MetaDataLoading, instead of the other way around? This would follow the order in which they are created: getDbCohortMethodData() would create MetaDataLoading, and createPs() would extend that with the attributes (and any methods) to represent the PS meta-data. The outcome model meta data would extend the MetaDataPS. And whatever is the top class would need to have the attrition attribute.

In response to your questions:

  1. If we're going to be strict (which I propose we do), meta-data should be set when the object is created (via the constructor), and then not be allowed to be modified. (so only getters, not setters). However, we do want the attrition table to grow over time, which can be viewed as modification although we shouldn't touch earlier entries in the table.

  2. In general I try to avoid making up new things, so I probably would name the metadata classes after the functions that generate them. That would lead to long names though, like 'GetDbCohortMethodDataMetaData` :-( Note that currently the class name 'OutcomeModel' conflicts with this.

  3. I don't think so, but I guess that depends on what inherits from what.

@mvankessel-EMC
Copy link
Collaborator Author

Should MetaDataPS inherit from MetaDataLoading, instead of the other way around? This would follow the order in which they are created: getDbCohortMethodData() would create MetaDataLoading, and createPs() would extend that with the attributes (and any methods) to represent the PS meta-data. The outcome model meta data would extend the MetaDataPS. And whatever is the top class would need to have the attrition attribute.

So my reasoning as to why the classes are setup like that, is that MetaDataPS is the most simple of the bunch, and we'd extend the child classes where needed with additional methods, i.e: MetaDataLoading: formatStudyDates() and for OutcomeModel: coef(), confint().

If we'd implement as you propose, method formatStudyDates from MetaDataLoading would also be inherited to all child classes, but I don't think we would need this method in any of them.

It boils down to an organizational choice to keep methods in places where they're needed.

In response to your questions:

  1. If we're going to be strict (which I propose we do), meta-data should be set when the object is created (via the constructor), and then not be allowed to be modified. (so only getters, not setters). However, we do want the attrition table to grow over time, which can be viewed as modification although we shouldn't touch earlier entries in the table.

I agree, we can add an appendAttrition(data.frame) method to the attrition class, that would just add new rows.

  1. In general I try to avoid making up new things, so I probably would name the metadata classes after the functions that generate them. That would lead to long names though, like 'GetDbCohortMethodDataMetaData` :-(

Generally I agree, I don't know if we should just bite the bullet on this. The class only really contains a targetId, comparatorId, studyStartDate, and studyEndDate. So I'll do one more suggestion: StudyMetaData. Otherwise it can just be GetDbCOhortMethodDataMetaData as a working name.

Note that currently the class name 'OutcomeModel' conflicts with this.

I thought we'd eventually replace the S3 implementation with the R6 one, we can name it OutcomeModelR6 for now, as the idea of it is the same, the implementation is different.

  1. I don't think so, but I guess that depends on what inherits from what.

@schuemie
Copy link
Member

schuemie commented Jun 7, 2023

But the meta-data coming out of the createPs() function will also contain all the meta-data that came out of getDbCohortMethodData(), right? The trail keeps growing as more data is applied. So the createPs metadata should have a formatStudyDates() function because it will contain the study dates used to create the CohortMethodData that was the input to createPs.

Why would the class for the meta-data for the outcome model be called 'OutcomeModel'? I thought we'd distinguish between the data itself and its meta-data, but perhaps you're thinking of those being represented by a single class?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants