Skip to content

CanCOGeN Contextual Data Datasets for Harmonization Training and Quality Control

Rhiannon Cameron edited this page Apr 28, 2023 · 1 revision

Harmonizing contextual data to a data standard can be challenging when your original dataset contains different data structures. Examples of harmonized data can help demonstrate expectations and how to troubleshoot data transformations. Below are a series of mock datasets intended to aid in data harmonization training and quality control using the DataHarmonizer for implementing the CanCOGeN SARS-CoV-2 contextual data standard.

Datasets based on DataHarmonizer v0.15.1

“Gold Standard Dataset”

The “Gold Standard Dataset” provides examples of well-structured information in required, as well as commonly filled optional fields. Usage: This dataset is ideal for testing a new software install or making sure software is running as expected.

Find the Gold Standard Dataset here.

“Common Errors Dataset”

The “Common Errors Dataset” was designed to contain errors commonly encountered by curators, which address issues such as missing information, incorrectly formatted dates, non-ideal “purpose of sequencing” tags, incorrectly used picklists and null values (e.g. wrong terms, capitalization differences), information in the wrong cells, etc. Most errors will be flagged upon validation, but there are some that require identification by the data provider/curator. Usage: This dataset is ideal for evaluating how version changes to the DataHarmonizer may affect laboratory processes, as well as for operator proficiency testing.

Find the Common Errors Dataset here.

“Scenarios Dataset”

The “Scenarios Dataset” is comprised of a document containing descriptions of common scenarios with many pieces of information which must be identified, interpreted, and then structured using the data standard. This dataset also provides an “answer key” containing recommended data structures for the information given. The Scenarios dataset is intended as an exercise for improving data transfer and data management practices from “in-the-field” situations. Usage: This dataset is ideal for training new staff or operators in proper use of the DataHarmonizer.

Find the Scenario Descriptions here.

Find the Scenario Dataset here.