Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DC-2692] synthetic dataset script version 1 #1369

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Conversation

lrwb-aou
Copy link
Contributor

  • write a script that automates the largest pain points of generating synthetic data from the RDR team

@ksdkalluri ksdkalluri self-requested a review September 1, 2022 18:12
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch 2 times, most recently from b7f6657 to 55bd1a6 Compare September 21, 2022 22:12
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from 55bd1a6 to 7843d34 Compare September 30, 2022 21:51
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from edc91ca to b60d1b3 Compare October 10, 2022 22:22
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from b60d1b3 to d878b40 Compare November 11, 2022 19:23
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from 7cdee18 to 408b85f Compare February 27, 2023 22:53
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from 5f1e99b to 7daee0f Compare March 30, 2023 17:28
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from 7daee0f to e023eef Compare April 24, 2023 17:49
* altering the base class to have another attribute which defaults to false
* adding True attributes to those classes we want to run on a synthetic data set
* altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed
* works with listing the queries as well
* need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally
* fixing two unit tests
* ignoring the redefinition here.
* redefinition will be removed once all classes are base classed
* alter the import script to warn curation whenever we see a table in the bucket that we do not process
* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()
* loads data from a bucket into a raw dataset
* creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean)
* runs synthetic pipeline data stage on the data in the staging dataset
* TODO:  add publishing guidelines to script.
* leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables.
* update the synthetic data stage to leverage the Registered Tier dataset cleaning rules
* allow extension table generation and cope survey versioning to run on synthetic data.
* TODO:  "publish" data to an internal dataset.
* making sure person table columns are appended
* The txt file was not meant for inclusion.
* some changes to the script while trying to run it initially
* adding vocab_dataset parameter
* changes required when running the synthetic script all the way through
* the script did finish
* more changes are expected
* sets some run_for_synthetic rules to False to avoid dropping too much test data
* changed f-string usage to jinja2 templates
* used pre-defined variable for constant value
* removed redundant code to reuse existing dataset copy utility
* removed conflict code
* uses cleaning rules to clean survey_conduct table data
* removes duplicated code to create cleaned survey_conduct table data
* prepares to potentially run all rules from RDR ingest to RT clean dataset
* still only runs a subset of rules marked as run_for_synthetic
@lrwb-aou lrwb-aou force-pushed the lb/synthetic_on_stable branch from 8ed9dcd to 70ed906 Compare August 28, 2023 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant