create a list of important characteristics to be auto checked after data.prep runs #6

cjyetman · 2022-11-01T07:48:39Z

It would be nice to start compiling a list of data characteristics that should be checked once data.prep is done, and eventually write some code that will automatically check/report on them after every run.

I'll start with a few, and please add to it if you think of any...

number of unique funds in the fund database
verify that in every fund, the sum of the holdings' values adds up to the total value
verify that each isin in isin_to_fund_table points a single fund, i.e. is unique
verify that all output RDS files are ungrouped (relates to RMI-PACTA/archive.pacta.data.preparation#218)

cjyetman · 2022-11-02T14:14:28Z

case in point RMI-PACTA/archive.pacta.data.preparation#210

cjyetman · 2024-03-13T07:04:06Z

I think this should be transferred to workflow.transition.monitor? @jdhoffa @AlexAxthelm

AlexAxthelm · 2024-03-13T12:30:15Z

I would say its better in https://github.com/RMI-PACTA/workflow.pacta.data.qa

jdhoffa · 2024-03-14T10:11:32Z

Based on the actual check list that @cjyetman wrote, I think I agree that it should live in https://github.com/RMI-PACTA/workflow.pacta.data.qa

Maybe we need to discuss and focus our overall QA strategy a bit?

My first proposal would be that:

https://github.com/RMI-PACTA/workflow.pacta.data.qa would be the repo to check any raw input data the second it "comes through the door" (e.g. AI data, FactSet, maybe Scenario data, etc.), and flag to us interactively if there are any big differences that may cause issues
In https://github.com/RMI-PACTA/workflow.transition.monitor, we probably want to run only basic input checks to ensure that the actual input data is valid structurally (and contains self-consistent sector and technology names, etc.). For example: I can imagine a case in the future where we may want Transition Monitor to run with a stripped down dataset, with no Funds, for some very particular user, so validating that the data has enough funds there may not be the best idea

cjyetman · 2024-03-14T10:23:44Z

I don't like the idea of workflow.transition.monitor silently exporting bogus data, so I intend to implement some version of this there. Not opposed to similar checks living in the QA repo.

jdhoffa · 2024-03-14T10:26:51Z

Yeah totally fair.

I think mainly the point:

number of unique funds in the fund database

seems like it's not really appropriate for https://github.com/RMI-PACTA/workflow.transition.monitor

A priori, how should https://github.com/RMI-PACTA/workflow.transition.monitor know how many funds should be expected?

But the other basic self-consistency checks totally make sense.

Maybe just close this issue, and make two more focused issues in https://github.com/RMI-PACTA/workflow.transition.monitor and https://github.com/RMI-PACTA/workflow.pacta.data.qa

cjyetman · 2024-03-14T10:46:21Z

number of unique funds in the fund database

seems like it's not really appropriate for https://github.com/RMI-PACTA/workflow.transition.monitor

A priori, how should https://github.com/RMI-PACTA/workflow.transition.monitor know how many funds should be expected?

I think the original intent behind "number of funds" was for the "reporting" part of it versus the "checking" part of it, e.g. potentially adding that to the manifest so it's super easy for someone to inspect what's in the data without having to load it.

I think @AlexAxthelm did something similar with workflow.factset recording data object stats in the manifest.

jdhoffa · 2024-03-14T11:08:37Z

Ah that's a cool idea and makes a lot of sense!

Ok then as you were!! XD

AlexAxthelm · 2024-03-15T10:55:28Z

I think @AlexAxthelm did something similar with workflow.factset recording data object stats in the manifest.

NB: this is also included in the new pacta.workflow.utils manifest export that I'm working on. https://github.com/RMI-PACTA/pacta.workflow.utils/blob/ba9b91444cd3a2fe35f3acb94b9458108a992353/R/get_file_metadata.R#L69-L75

AlexAxthelm · 2024-03-15T10:59:47Z

I don't like the idea of workflow.transition.monitor silently exporting bogus data, so I intend to implement some version of this there. Not opposed to similar checks living in the QA repo.

I think that it might be a bit simpler to maintain if we add the QA repo to Suggests, and call the checks with an if (require(workflow.pacta.data.qa)){run_data_prep_checks()}

jdhoffa · 2024-03-15T11:04:52Z

I don't like the idea of workflow.transition.monitor silently exporting bogus data, so I intend to implement some version of this there. Not opposed to similar checks living in the QA repo.

I think that it might be a bit simpler to maintain if we add the QA repo to Suggests, and call the checks with an if (require(workflow.pacta.data.qa)){run_data_prep_checks()}

Just want to bare in mind that we are talking about large hypotheticals there since there is no way to run https://github.com/RMI-PACTA/workflow.pacta.data.qa non-interactively at this stage (the mainoutputs are .pdfs)

jdhoffa added the enhancement New feature or request label Nov 17, 2022

jdhoffa transferred this issue from another repository Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create a list of important characteristics to be auto checked after data.prep runs #6

create a list of important characteristics to be auto checked after data.prep runs #6

cjyetman commented Nov 1, 2022 •

edited by jdhoffa

Loading

cjyetman commented Nov 2, 2022

cjyetman commented Mar 13, 2024

AlexAxthelm commented Mar 13, 2024

jdhoffa commented Mar 14, 2024

cjyetman commented Mar 14, 2024

jdhoffa commented Mar 14, 2024 •

edited

Loading

cjyetman commented Mar 14, 2024 •

edited

Loading

jdhoffa commented Mar 14, 2024 •

edited

Loading

AlexAxthelm commented Mar 15, 2024

AlexAxthelm commented Mar 15, 2024

jdhoffa commented Mar 15, 2024

create a list of important characteristics to be auto checked after data.prep runs #6

create a list of important characteristics to be auto checked after data.prep runs #6

Comments

cjyetman commented Nov 1, 2022 • edited by jdhoffa Loading

cjyetman commented Nov 2, 2022

cjyetman commented Mar 13, 2024

AlexAxthelm commented Mar 13, 2024

jdhoffa commented Mar 14, 2024

cjyetman commented Mar 14, 2024

jdhoffa commented Mar 14, 2024 • edited Loading

cjyetman commented Mar 14, 2024 • edited Loading

jdhoffa commented Mar 14, 2024 • edited Loading

AlexAxthelm commented Mar 15, 2024

AlexAxthelm commented Mar 15, 2024

jdhoffa commented Mar 15, 2024

cjyetman commented Nov 1, 2022 •

edited by jdhoffa

Loading

jdhoffa commented Mar 14, 2024 •

edited

Loading

cjyetman commented Mar 14, 2024 •

edited

Loading

jdhoffa commented Mar 14, 2024 •

edited

Loading