Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMHDR-252 Update external parquet pipeline to use the internal parquet archive #26

Merged

Conversation

pranavanba
Copy link
Collaborator

Major Changes

  1. Sync latest internal parquet archive instead of top-level parquet data
    • This allows for provenance tracking of what internal data was used to build the external datasets
  2. Use date from archive timestamp instead of getting date of current day
  3. Update file provenance to use latest archive's S3 URI instead of parent-level internal parquet bucket URI
  4. Store file provenance in separate synapser Activity method variable
  5. Use specific script URL instead of repo url at latest commit for file provenance
  6. Move scripts to new scripts folder
  7. Update max_rows_per_file param and specify max_open_files param for write_dataset() calls

Minor Changes

  1. Added param in config file specifying location of Dictionaries folder in Synapse
  2. In Deidentification.R (de-identification script),
    • Remove numbers from dictionary file names
    • Update provenance of files containing new PII values to review- use latest git commit of de-identification script
  3. Update formatting of msgs printed during filtering step
  4. Remove unused libraries

@pranavanba pranavanba self-assigned this Apr 17, 2024
@pranavanba pranavanba merged commit 4f0e58d into Sage-Bionetworks:main Apr 17, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant