From 163e75007d70999fb0d366db68203b2b93b2bc95 Mon Sep 17 00:00:00 2001 From: <> Date: Fri, 8 Sep 2023 22:34:42 +0000 Subject: [PATCH] Deployed 84f723d with MkDocs version: 1.5.2 --- search/search_index.json | 2 +- sitemap.xml.gz | Bin 336 -> 336 bytes vlmd/extract/exceldata/index.html | 19 ++++++++++--------- 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/search/search_index.json b/search/search_index.json index c1b6017..411cbff 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"HEAL Data Utilities","text":"

The HEAL Data Utilities python package provides data packaging tools for the HEAL Data Ecosystem to facilitate data discovery, sharing, and harmonization on the HEAL Platform.

Currently, the focus of this repository is generating standardized variable level metadata (VLMD) in the form of data dictionaries. See the quick start section to get started without installing any of the prerequisites. (Click here for the Variable-level Metadata documentation section).

However, in the future, this will be expanded for all HEAL-specific data packaging functions (e.g., study- and file-level metadata and data).

"},{"location":"#quick-start","title":"Quick start","text":"

Note

If using the quick start option, no prerequisites are required.

Double click on the vlmd (or vlmd.exe) executable or run the vlmd executable without any arguments to quickly start using this tool. This \"quick start\" will take walk you through step by step by prompting you of the various options.

Important

Stand alone applications for different operating systems are available here. These allow you to run the vlmd tool without needing to install anything else. Just (1) download, (2) unzip, and (3) double click on the vlmd application icon.

"},{"location":"#prerequisites","title":"Prerequisites","text":""},{"location":"#python","title":"Python","text":"

While the HEAL Data Utilities should be compatible with most versions of Python, you can download the latest version of Python here and install it on your local computer. We recommend installing Python version 3.10.

"},{"location":"#installation","title":"Installation","text":"

To install the latest official release of healdata-utils, from your computer's command prompt, run:

pip install healdata-utils

OR for the most up-to-date unreleased version run:

pip install git+https://github.com/norc-heal/healdata-utils.git

Note

Installing the unreleased version requires having git software installed.

"},{"location":"vlmd/","title":"Variable-level Metadata (Data Dictionaries)","text":""},{"location":"vlmd/#motivation","title":"Motivation","text":"

Variable level metadata (VLMD), in the form of standardized data dictionaries, provides an exciting opportunity:

For an example of this searchability in the context of study level metadata, see the platform's discovery page

"},{"location":"vlmd/#functions","title":"Functions","text":"

extract: Extract the variable level metadata from an existing file with a specific type/format

start: Start a data dictionary from an empty template

validate: Check (validate) an existing HEAL data dictionary file to see if it follows the HEAL specifications after filling out a template or further annotation after extracting from a different format.

Typical workflows for creating a HEAL-compliant data dictionary include:

  1. Create your data dictionary

    (a) Run the vlmd extract command (or convert_to_vlmd if in python) to generate a HEAL-compliant data dictionary via your desired input format

    (b) Run the vlmd template command to start from an empty template.

  2. Add/annotate with additional information in your preferred HEAL data dictionary format (either json or csv).

  3. Run the vlmd validate command with your HEAL data dictioanry as the input to validate.

  4. Repeat (2) and (3) until you are ready to submit. Please note, currently only name and description are required.

"},{"location":"vlmd/#definitions","title":"Definitions","text":"

Important

The main difference* between the CSV and JSON definitions lies in the way the data dictionaries are structured and the additional metadata included in the JSON data dictionary.

The CSV data dictionary is a plain tabular representation with no additional metadata, while the JSON dataset includes fields along with additional metadata in the form of a root description and title.

For more information on variable-level metadata properties (fields), see the csv field specification and json data dictionary specification.

"},{"location":"vlmd/extract/","title":"Extract VLMD from another data type and format","text":"

The healdata-utils variable-level metadata (vlmd) tool inputs a variety of different input file types and extracts HEAL-compliant data dictionaries (JSON and CSV formats). Additionally, exported validation (i.e., \"error\") reports provide the user information as to a) if the exported data dictionary is valid according to HEAL specifications and b) how to modify one's data dictionary to make it HEAL-compliant.

Warning

Currently the python subcommand is convert but will be changed to extract_to_vlmd to be consistent with CLI. extract was chosen to better reflect the functionality.

Command Line Interface (CLI)Python
vlmd extract --inputtype spss myproject/myfile.sav\n

Note

To continue, it's recommended to go to the input types and formats. Also, for more details on the different flags/options, run vlmd --help

from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myproject/myfile.sav\",inputtype=\"spss\")\n

Note

To continue, it's recommended to go to the input types and formats. For a complete set of options with convert_to_vlmd see the docstring (if in a notebook, one can enter convert_to_vlmd?)

"},{"location":"vlmd/extract/#input-types-and-formats","title":"Input Types and Formats","text":"

This section provides the specific syntax for running each of the supported types for generating HEAL-compliant data dictionaries are listed. Additional instructions on how to obtain the necessary input files/software are also provided.

Note

To further annotate your outputted data dictionaries, see the variable-level metadata field properties (with examples) for either the csv data dictionary click here or the json data dictionary click here.

Extract variable level metadata from your data:

"},{"location":"vlmd/extract/#output","title":"Output","text":"

Both the python and command line routes will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors folder. See below:

If valid, this file will contain:

{\n\"valid\": true,\n\"errors\": []\n}\n
- errors/heal-json-errors.json: outputted jsonschema validation report.

If no outputdir specified, the resulting HEAL-compliant data dictionaries will be named:

"},{"location":"vlmd/extract/csvdata/","title":"csv Datasets","text":"

CSV (comma-separated values) is the main open tabular data format for storage and exchange. It is easy to create and understand using basic text editors in addition to popular spreadsheet software like Google Sheets and Excel. Importantly, CSVs are simple and can be easily integrated into web applications and just about any software.

Currently, the HEAL Data Utilities vlmd function can infer a minimal, HEAL-compliant dataset by inferring name, type, and enum (i.e., possible values). After this minimal data dictionary is generated, the researcher can further annotate it with fields' description and other optional properties in either the HEAL-compliant csv- or json-formatted data dictionary (see the HEAL data dictionary template sections below for more information).

"},{"location":"vlmd/extract/exceldata/","title":"Excel (xlsx) dataset","text":"

Excel workbooks contain tabular data tables across named worksheets.

This vlmd extraction tool provides the ability to extract vlmd from all of these worksheets either as a combined data dictionary or as multiple data dictionaries.

"},{"location":"vlmd/extract/exceldata/#run-the-vlmd-command","title":"Run the vlmd command","text":"CLIPython
vlmd extract --inputtype excel-data myexcelfile.xlsx\n
"},{"location":"vlmd/extract/exceldata/#to-output-multiple-sheets-as-separate-data-dictionaries","title":"To output multiple sheets as separate data dictionaries","text":"
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myexcelfile.xlsx\",inputtype=\"excel-data\")\n
"},{"location":"vlmd/extract/exceldata/#to-extract-multiple-sheets-as-one-data-dictionary","title":"To extract multiple sheets as one data dictionary","text":"

Note

Be careful about using the multiple_data_dicts=False. In most instances, one sheet should correspond to one separate data table and thus have one corresponding data dictionary.

Note, this combines (ie concatenates all data tables) and then infers fields. This use case is when sheets are viewed as \"chunks\" of one resource/dataset.

from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n    filepath=\"myexcelfile.xlsx\",\n    inputtype=\"excel-data\",\n    multiple_data_dicts=False\n    )\n
"},{"location":"vlmd/extract/exceldata/#to-extract-a-subset-of-sheets-as-one-data-dictionary","title":"To extract a subset of sheets as one data dictionary","text":"

```python

from healdata_utils import convert_to_vlmd

convert_to_vlmd( filepath=\"myexcelfile.xlsx\", inputtype=\"excel-data\", multiple_data_dicts=False, sheet_name=[\"mysheet1\",\"mysheet2\"] )

"},{"location":"vlmd/extract/frictionlessschema/","title":"Frictionless Table Schema","text":"

While vlmd specifications are designed (and still being developed), to support interoperability with the heal platform, minor naming translations may be needed. This function supports any of said translations (eg., frictionless fields name --> heal data_dictionary)

Note, this conversion supports either yaml or json format (currently only tests for json format but should work with yaml).

"},{"location":"vlmd/extract/frictionlessschema/#creating-a-frictionless-table-schema","title":"Creating a frictionless table schema","text":"

Below are the official frictionless table schema specifications, which you will notice a high degree of overlap with the heal variable level metadata specifications.

See here for the frictionless table schema specs

"},{"location":"vlmd/extract/frictionlessschema/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype frictionless data/frictionless_dataset1.frictionless.schema.json\n
"},{"location":"vlmd/extract/redcapcsv/","title":"REDCap: Data Dictionary CSV Export","text":"

For users collecting data in a REDCap data management system, HEAL-compliant data dictionaries can be generated directly from REDCap exports.

The REDCap data dictionary export serves the purpose of providing variable-level metadata in a standardized, tabular format and is generally easy to export. The HEAL data utilities leverages this user experience and standardized format to enable HEAL researchers to generate a Heal-compliant data dictionary.

"},{"location":"vlmd/extract/redcapcsv/#export-your-redcap-data-dictionary","title":"Export your Redcap data dictionary","text":"

To download a REDCap CSV export, do the following*:

  1. After logging in to your REDCap project page, locate the Data dictionary page. A link to this page may be available on the project side bar (see image below) or in the Project Setup tab at the top of your page.

  1. After arriving at the Data dictionary page, click on Download the current data dictionary to export the dictionary (see below).

*there may be slight differences depending on your specific REDCap instance and version

"},{"location":"vlmd/extract/redcapcsv/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype redcap input/example_redcap_demo.redcap.csv 
"},{"location":"vlmd/extract/sas/","title":"SAS sas7bdat (and sas7bcat) files","text":"

To accommodate SAS users, the HEAL Data Utilities supports the binary sas7bdat file format, which contains the actual data values (observations/records). This file also includes variable metadata (variable names and variable labels/ descriptions).

The HEAL Data Utilities also provides the option to include a catalog file \u2013 sas7bcat format - with the sas7bdat. A sas7bcat file contains variable value labels, or encodings, that can be mapped onto the corresponding data from a sas7bdat file.

"},{"location":"vlmd/extract/sas/#creating-a-sas7bdat-and-a-sas7bcat-file","title":"Creating a sas7bdat and a sas7bcat file","text":"

Many SAS users build formats and labels into their data processing and analysis scripts. In this section, we provide syntax that can be easily copy-pasted into these existing workflows to create sas7bdat and sas7bcat files to input into the vlmd tool.

This script template can be run separately or inserted directly at the end of a SAS user's workflow.

Note

If inserted directly, remember to delete the lines with %INCLUDE)

Template template.sas
/*1. Read in data file without value labels and run full code. \n        Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n        that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/\n\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SECOND SEPARATE SAS SCRIPT*/ \n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file. If you already have an out directory assigned, skip this step and replace \u201cout\u201d with your out directory libname in the flow*/\n\nlibname out \"<INSERT THE DESIRED LOCATION (FILE PATH) TO YOUR SAS7BCAT AND SAS7BDAT FILES HERE>\";\n\n/*2b. Output the format catalog.\n        Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n        out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n    copy out=out.FORMATS;\n    run;\n\n/*3. Output the data file (sas7bdat) */\ndata out.yourdata;\n    set <INSERT THE NAME OF YOUR FINAL SAS DATASET HERE>;\n    run;\n

The below SAS syntax is an example of how to use the template within your SAS workflow.

The below sample script creates all of our variable and value labels. Your workflow may include multiple SAS scripts with multiple format statements and may include analyses and other PROC calls for data exploration, but for demonstration purposes, this example only uses one script and focuses on defining the variable and value labels.

Example my_existing_sas_workflow.sas
/*1. Read in input data */\nproc import datafile=\"myprojectfolder/input/mydata.csv\"\n    out=raw\n    dbms=csv replace;\n    getnames=yes;\nrun;\n\n/*2. Set up proc format and apply formats and variable labels in data step */\n/*Create encodings (value labels)*/\nproc format;\n    VALUE YESNO\n    0       =\"No\"\n    1       =\"Yes\"\n\n    VALUE PUBLIC\n    1='State mental health authority (SMHA)'\n    2='Other state government agency or department'\n    3='Regional/district authority or county, local, or municipal government'\n    4='Tribal government'\n    5='Indian Health Service'\n    6='Department of Veterans Affairs'\n    7='Other'\n\n    VALUE FOCUS\n    1='Mental health treatment'\n    2='Substance abuse treatment'\n    3='Mix of mental health and substance abuse treatment (neither is primary)'\n    4='General health care'\n    5='Other service focus';\n\n**Apply formats to dataset;\ndata processed;\n    set raw;\n\n    /*Assign formats*/\n    format YOUNGADULTS TREATPSYCHOTHRPY TREATTRAUMATHRPY YESNO. FOCUS FOCUS. PUBLIC PUBLIC.;\n    /*Add variable labels*/\n    label YOUNGADULTS=\"Accepts young adults (aged 18-25 years old) for Tx\"\n            TREATPSYCHOTHRPY=\"Facility offers individual psychotherapy\"\n            TREATTRAUMATHRPY=\"Facility offers trauma therapy\"\n            FOCUS=\"Primary treatment focus of facility\"\n            PUBLIC=\"Public agency or department that operates facility\";\nrun;\n

This second script called my_output.sas is the filled out template. Note the %INCLUDE function that calls my_existing_sas_workflow.sas

my_output.sas
/*1. Read in data file without value labels and run full code. \n        Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n        that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/*/\n\n%INCLUDE \"myprojectfolder/my_existing_workflow.sas\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file.*/\nlibname out \"myprojectfolder/output\";\n\n/*2b. Output the format catalog.\n        Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n        out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n    copy out=out.FORMATS;\n    run;\n\n/*3. Output the data file (sas7bdat) to your output folder*/\ndata out.yourdata;\n    set processed;\n    run;\n
"},{"location":"vlmd/extract/sas/#run-the-vlmd-command","title":"Run the vlmd command","text":"

After creating the necessary sas7bdat and sas7bcat files, you can then run the vlmd command. The tool, will automatically detect the sas7bcat file if located in the same directory as your data file. If not detected, the command will run without the sas7bcat catalog file and the encodings (i.e., value labels) will not be extracted from the catalog file.

vlmd extract --inputtype sas input/data.sas7bdat 
"},{"location":"vlmd/extract/spss/","title":"SPSS .sav files","text":"

For SPSS users, the HEAL Data Utilities generates HEAL-compliant data dictionaries from SPSS's default file format for storing datasets: a SAV file. It stores not only the data itself but also metadata such as variable names, variable labels, types, and value labels. The HEAL Data Utilities extracts these data and metadata to create HEAL-compliant data dictionaries.

"},{"location":"vlmd/extract/spss/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype spss data/example_pyreadstat_output.sav 
"},{"location":"vlmd/extract/stata/","title":"Stata .dta files","text":"

For Stata users, the HEAL Data Utilities generates HEAL-compliant data dictionaries through Stata's default file format: a DTA file. DTA files store not only the data itself but also metadata such as variable names, variable labels, types, and value labels.

"},{"location":"vlmd/extract/stata/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype stata data/mydatafile.dta 
"},{"location":"vlmd/schemas/","title":"HEAL data dictionary schemas","text":"

Click on each data dictionary schema below to view information about each format's data dictionary properties (such as a description, examples, etc).

CSV fields

JSON data dictionary

Note

enum type means that a field can only be one of a certain set of possible values.

"},{"location":"vlmd/schemas/csv-fields/","title":"Tabular (CSV) data dictionary","text":"HEAL Variable Level Metadata Fields HEAL Variable Level Metadata Fields Type: object

Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.

Note, only name and description are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root moduleType: string

The section, form, survey instrument, set of measures or other broad category used to group variables.

Examples:
\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\n
name Required root nameType: string

The name of a variable (i.e., field) as it appears in the data.

[Required]

title root titleType: string

The human-readable title or label of the variable.

[Highly recommended]

Example:
\"My Variable (for name of my_variable)\"\n
description Required root descriptionType: string

An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).

[Required]

Examples:
\"Definition\"\n
\"Question text (if a survey)\"\n
type root typeType: enum (of string)

A classification or category of a particular data element or property expected or allowed in the dataset.

Must be one of: format root format

A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification

Each format is dependent on the type specified. For example: If type is \"string\", then see the String formats. If type is one of the date-like formats, then see Date formats.

Any of root format anyOf String FormatType: enum (of string) Must be one of: root format anyOf Date FormatType: object

A format for a date variable (date,time,datetime). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strftime.

\\nExamples:

%Y-%m-%d (for date, e.g., 2023-05-25) %Y%-%d (for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S (for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ (for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z (for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M (for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H (for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S (for time, e.g., 10:30:45) %H:%M:%SZ (for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z (for time with timezone offset, e.g., 10:30:45+0300)

root format anyOf Geopoint Format

The two types of formats for geopoint (describing a geographic point).

One of root format anyOf Geopoint Format oneOf item 0Type: array

A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.

root format anyOf Geopoint Format oneOf item 1Type: object

Contains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.

root format anyOf geojsonType: enum (of string)

The JSON object according to the geojson spec.

Must be one of: constraints.maxLength root constraints.maxLengthType: integer

Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.

[Optional,if applicable]

constraints.enum root constraints.enumType: string

Constrains possible values to a set of values.

[Optional,if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ constraints.pattern root constraints.patternType: string

A regular expression pattern the data MUST conform to.

[Optional,if applicable]

constraints.maximum root constraints.maximumType: integer

Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.

[Optional,if applicable]

encodings root encodingsType: string

Variable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.

Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.

Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).

[Optional,if applicable]

Must match regular expression: ^(?:.*?=.*?(?:\\||$))+$ Examples:
\"0=No|1=Yes\"\n
\"HW=Hello world|GBW=Good bye world|HM=Hi,Mike\"\n
ordered root orderedType: boolean

Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).

[Optional,if applicable]

missingValues root missingValuesType: string

A list of missing values specific to a variable.

[Optional, if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ trueValues root trueValuesType: string

For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.

[Optional, if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ Examples:
\"Required|REQUIRED\"\n
\"required|Yes|Y|Checked\"\n
\"Checked\"\n
\"Required\"\n
falseValues root falseValuesType: string

For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ repo_link root repo_linkType: string

A link to the variable as it exists on the home repository, if applicable

cde_id.source root cde_id.sourceType: string cde_id.id root cde_id.idType: string ontology_id.relation root ontology_id.relationType: string ontology_id.source root ontology_id.sourceType: string ontology_id.id root ontology_id.idType: string standardsMappings.type root standardsMappings.typeType: string

The type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Examples:
\"cde\"\n
\"ontology\"\n
\"reference_list\"\n
standardsMappings.label root standardsMappings.labelType: string

A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.

[Autopopulated, if not filled]

Examples:
\"substance use\"\n
\"chemical compound\"\n
\"promis\"\n
standardsMappings.url root standardsMappings.urlType: stringFormat: uri

The url that links out to the published, standardized mapping.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
standardsMappings.source root standardsMappings.sourceType: string

The source of the standardized variable.

Example:
\"TBD (will have controlled vocabulary)\"\n
standardsMappings.id root standardsMappings.idType: string

The id locating the individual mapping within the given source.

relatedConcepts.type root relatedConcepts.typeType: string

The type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

relatedConcepts.label root relatedConcepts.labelType: string

A free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

relatedConcepts.url root relatedConcepts.urlType: stringFormat: uri

The url that links out to the published, standardized concept.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
relatedConcepts.source root relatedConcepts.sourceType: string

The source of the related concept.

[Autopopulated, if not filled]

Example:
\"TBD (will have controlled vocabulary)\"\n
relatedConcepts.id root relatedConcepts.idType: string

The id locating the individual mapping within the given source.

[Autopopulated, if not filled]

univarStats.median root univarStats.medianType: number univarStats.mean root univarStats.meanType: number univarStats.std root univarStats.stdType: number univarStats.min root univarStats.minType: number univarStats.max root univarStats.maxType: number univarStats.mode root univarStats.modeType: number univarStats.count root univarStats.countType: integer

Value must be greater or equal to 0

univarStats.twentyFifthPercentile root univarStats.twentyFifthPercentileType: number univarStats.seventyFifthPercentile root univarStats.seventyFifthPercentileType: number univarStats.categoricalMarginals.name root univarStats.categoricalMarginals.nameType: string univarStats.categoricalMarginals.count root univarStats.categoricalMarginals.countType: integer Additional Properties

Additional Properties of any type are allowed.

root additionalPropertiesType: object

Generated using json-schema-for-humans on 2023-07-05 at 17:11:06 -0500

"},{"location":"vlmd/schemas/json-data-dictionary/","title":"JSON data dictionary","text":"Variable Level Metadata (Data Dictionaries) Variable Level Metadata (Data Dictionaries) Type: object

This schema defines the variable level metadata for one data dictionary for a given study.Note a given study can have multiple data dictionaries

title Required root titleType: string description root descriptionType: string data_dictionary Required root data_dictionaryType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata FieldsType: object

Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.

Note, only name and description are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root data_dictionary HEAL Variable Level Metadata Fields moduleType: string

The section, form, survey instrument, set of measures or other broad category used to group variables.

Examples:
\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\n
name Required root data_dictionary HEAL Variable Level Metadata Fields nameType: string

The name of a variable (i.e., field) as it appears in the data.

[Required]

title root data_dictionary HEAL Variable Level Metadata Fields titleType: string

The human-readable title or label of the variable.

[Highly recommended]

Example:
\"My Variable (for name of my_variable)\"\n
description Required root data_dictionary HEAL Variable Level Metadata Fields descriptionType: string

An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).

[Required]

Examples:
\"Definition\"\n
\"Question text (if a survey)\"\n
type root data_dictionary HEAL Variable Level Metadata Fields typeType: enum (of string)

A classification or category of a particular data element or property expected or allowed in the dataset.

Must be one of: format root data_dictionary HEAL Variable Level Metadata Fields format

A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification

Each format is dependent on the type specified. For example: If type is \"string\", then see the String formats. If type is one of the date-like formats, then see Date formats.

Any of root data_dictionary HEAL Variable Level Metadata Fields format anyOf String FormatType: enum (of string) Must be one of: root data_dictionary HEAL Variable Level Metadata Fields format anyOf Date FormatType: object

A format for a date variable (date,time,datetime). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strftime.

\\nExamples:

%Y-%m-%d (for date, e.g., 2023-05-25) %Y%-%d (for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S (for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ (for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z (for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M (for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H (for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S (for time, e.g., 10:30:45) %H:%M:%SZ (for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z (for time with timezone offset, e.g., 10:30:45+0300)

root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format

The two types of formats for geopoint (describing a geographic point).

One of root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 0Type: array

A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.

root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 1Type: object

Contains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.

root data_dictionary HEAL Variable Level Metadata Fields format anyOf geojsonType: enum (of string)

The JSON object according to the geojson spec.

Must be one of: constraints root data_dictionary HEAL Variable Level Metadata Fields constraintsType: object maxLength root data_dictionary HEAL Variable Level Metadata Fields constraints maxLengthType: integer

Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.

[Optional,if applicable]

enum root data_dictionary HEAL Variable Level Metadata Fields constraints enumType: array

Constrains possible values to a set of values.

[Optional,if applicable]

pattern root data_dictionary HEAL Variable Level Metadata Fields constraints patternType: string

A regular expression pattern the data MUST conform to.

[Optional,if applicable]

maximum root data_dictionary HEAL Variable Level Metadata Fields constraints maximumType: integer

Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.

[Optional,if applicable]

encodings root data_dictionary HEAL Variable Level Metadata Fields encodingsType: object

Variable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.

Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.

Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).

[Optional,if applicable]

Examples:
{\n\"0\": \"No\",\n\"1\": \"Yes\"\n}\n
{\n\"HW\": \"Hello world\",\n\"GBW\": \"Good bye world\",\n\"HM\": \"Hi, Mike\"\n}\n
ordered root data_dictionary HEAL Variable Level Metadata Fields orderedType: boolean

Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).

[Optional,if applicable]

missingValues root data_dictionary HEAL Variable Level Metadata Fields missingValuesType: array

A list of missing values specific to a variable.

[Highly recommended]

trueValues root data_dictionary HEAL Variable Level Metadata Fields trueValuesType: array of string

For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.

[Optional, if applicable]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields trueValues trueValues itemsType: string Examples:
\"Required\"\n
\"REQUIRED\"\n
\"required\"\n
\"Yes\"\n
\"Checked\\\"\"\n
falseValues root data_dictionary HEAL Variable Level Metadata Fields falseValuesType: array

For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.

repo_link root data_dictionary HEAL Variable Level Metadata Fields repo_linkType: string

A link to the variable as it exists on the home repository, if applicable

cde_id root data_dictionary HEAL Variable Level Metadata Fields cde_idType: array of object

[FUTURE WARNING: WILL BE DEPRECATED] Use standardsMapping. The source and id for the NIH Common Data Elements program.

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id itemsType: object source root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id items sourceType: string id root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id items idType: string ontology_id root data_dictionary HEAL Variable Level Metadata Fields ontology_idType: array of object

[FUTURE WARNING: WILL BE DEPRECATED] - Use relatedConcepts. Ontological information for the given variable as indicated by the source, id, and relation to the specified classification. One or more ontology classifications can be specified.

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id itemsType: object relation root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items relationType: string source root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items sourceType: string id root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items idType: string standardsMappings root data_dictionary HEAL Variable Level Metadata Fields standardsMappingsType: array of object

A published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items typeType: string

The type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Examples:
\"cde\"\n
\"ontology\"\n
\"reference_list\"\n
label root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items labelType: string

A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.

[Autopopulated, if not filled]

Examples:
\"substance use\"\n
\"chemical compound\"\n
\"promis\"\n
url root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items urlType: stringFormat: uri

The url that links out to the published, standardized mapping.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
source root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items sourceType: string

The source of the standardized variable.

Example:
\"TBD (will have controlled vocabulary)\"\n
id root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items idType: string

The id locating the individual mapping within the given source.

relatedConcepts root data_dictionary HEAL Variable Level Metadata Fields relatedConceptsType: array of object

Mappings to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc) [Autopopulated, if not filled]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items typeType: string

The type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

label root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items labelType: string

A free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

url root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items urlType: stringFormat: uri

The url that links out to the published, standardized concept.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
source root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items sourceType: string

The source of the related concept.

[Autopopulated, if not filled]

Example:
\"TBD (will have controlled vocabulary)\"\n
id root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items idType: string

The id locating the individual mapping within the given source.

[Autopopulated, if not filled]

univarStats root data_dictionary HEAL Variable Level Metadata Fields univarStatsType: object

Univariate statistics inferred from the data about the given variable

[Experimental]

median root data_dictionary HEAL Variable Level Metadata Fields univarStats medianType: number mean root data_dictionary HEAL Variable Level Metadata Fields univarStats meanType: number std root data_dictionary HEAL Variable Level Metadata Fields univarStats stdType: number min root data_dictionary HEAL Variable Level Metadata Fields univarStats minType: number max root data_dictionary HEAL Variable Level Metadata Fields univarStats maxType: number mode root data_dictionary HEAL Variable Level Metadata Fields univarStats modeType: number count root data_dictionary HEAL Variable Level Metadata Fields univarStats countType: integer

Value must be greater or equal to 0

twentyFifthPercentile root data_dictionary HEAL Variable Level Metadata Fields univarStats twentyFifthPercentileType: number seventyFifthPercentile root data_dictionary HEAL Variable Level Metadata Fields univarStats seventyFifthPercentileType: number categoricalMarginals root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginalsType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals itemsType: object name root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals items nameType: string count root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals items countType: integer Additional Properties

Additional Properties of any type are allowed.

root data_dictionary HEAL Variable Level Metadata Fields additionalPropertiesType: object

Generated using json-schema-for-humans on 2023-07-03 at 09:08:41 -0500

"},{"location":"vlmd/start/","title":"Start from a template","text":"

Some folks may prefer to create their HEAL data dictionary from scratch. To support this, we have created a utility that creates either a json or csv template.

Warning

Currently, the command is template but will change to start to be consistent with the verb subcommand vocabulary.

"},{"location":"vlmd/start/#csv-template","title":"csv template","text":"

The HEAL Data Utilities can also input a csv HEAL data dictionary either from a manually filled out template or as an additional step after further annotation (e.g., from the csv HEAL data dictionary output of the other file formats).

To create a template csv version with 10 fields (variables):

Command line interface (CLI)Python
vlmd template myhealdd.csv --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.csv\"),numfields=10)\n

Click here to download an example of a filled out csv HEAL data dictionary template

"},{"location":"vlmd/start/#json-template","title":"json template","text":"

While the csv HEAL data dictionary provides a tabular format for HEAL-compliant data dictionaries, ultimately, these csv data dictionary files are converted to a json file (the most common format to store and exchange data within web applications such as the HEAL Data Platform).

Another advantage of json HEAL data dictionaries is that one can specify metadata describing the data dictionary as a whole (e.g., the description and title).

To create a template json version with 10 fields (variables):

Command line interface (CLI)Python
vlmd template myhealdd.json --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.json\"),numfields=10)\n

Click here to download an example of filled out json HEAL data dictionary template

"},{"location":"vlmd/validate/","title":"Validate Check (validate) an existing HEAL data dictionary file","text":"

Will indicate if the data dictionary complies with the HEAL specifications.

Command line interface (CLI)Python
vlmd validate data/myhealcsvdd.csv\n\nvlmd validate data/myhealjsondd.json\n
from healdata_utils import validate_vlmd_csv,validate_vlmd_json\n\nvalidate_vlmd_csv(\"data/myhealcsvdd.csv\")\n\nvalidate_vlmd_json(\"data/myhealjsondd.json\")\n
"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"HEAL Data Utilities","text":"

The HEAL Data Utilities python package provides data packaging tools for the HEAL Data Ecosystem to facilitate data discovery, sharing, and harmonization on the HEAL Platform.

Currently, the focus of this repository is generating standardized variable level metadata (VLMD) in the form of data dictionaries. See the quick start section to get started without installing any of the prerequisites. (Click here for the Variable-level Metadata documentation section).

However, in the future, this will be expanded for all HEAL-specific data packaging functions (e.g., study- and file-level metadata and data).

"},{"location":"#quick-start","title":"Quick start","text":"

Note

If using the quick start option, no prerequisites are required.

Double click on the vlmd (or vlmd.exe) executable or run the vlmd executable without any arguments to quickly start using this tool. This \"quick start\" will take walk you through step by step by prompting you of the various options.

Important

Stand alone applications for different operating systems are available here. These allow you to run the vlmd tool without needing to install anything else. Just (1) download, (2) unzip, and (3) double click on the vlmd application icon.

"},{"location":"#prerequisites","title":"Prerequisites","text":""},{"location":"#python","title":"Python","text":"

While the HEAL Data Utilities should be compatible with most versions of Python, you can download the latest version of Python here and install it on your local computer. We recommend installing Python version 3.10.

"},{"location":"#installation","title":"Installation","text":"

To install the latest official release of healdata-utils, from your computer's command prompt, run:

pip install healdata-utils

OR for the most up-to-date unreleased version run:

pip install git+https://github.com/norc-heal/healdata-utils.git

Note

Installing the unreleased version requires having git software installed.

"},{"location":"vlmd/","title":"Variable-level Metadata (Data Dictionaries)","text":""},{"location":"vlmd/#motivation","title":"Motivation","text":"

Variable level metadata (VLMD), in the form of standardized data dictionaries, provides an exciting opportunity:

For an example of this searchability in the context of study level metadata, see the platform's discovery page

"},{"location":"vlmd/#functions","title":"Functions","text":"

extract: Extract the variable level metadata from an existing file with a specific type/format

start: Start a data dictionary from an empty template

validate: Check (validate) an existing HEAL data dictionary file to see if it follows the HEAL specifications after filling out a template or further annotation after extracting from a different format.

Typical workflows for creating a HEAL-compliant data dictionary include:

  1. Create your data dictionary

    (a) Run the vlmd extract command (or convert_to_vlmd if in python) to generate a HEAL-compliant data dictionary via your desired input format

    (b) Run the vlmd template command to start from an empty template.

  2. Add/annotate with additional information in your preferred HEAL data dictionary format (either json or csv).

  3. Run the vlmd validate command with your HEAL data dictioanry as the input to validate.

  4. Repeat (2) and (3) until you are ready to submit. Please note, currently only name and description are required.

"},{"location":"vlmd/#definitions","title":"Definitions","text":"

Important

The main difference* between the CSV and JSON definitions lies in the way the data dictionaries are structured and the additional metadata included in the JSON data dictionary.

The CSV data dictionary is a plain tabular representation with no additional metadata, while the JSON dataset includes fields along with additional metadata in the form of a root description and title.

For more information on variable-level metadata properties (fields), see the csv field specification and json data dictionary specification.

"},{"location":"vlmd/extract/","title":"Extract VLMD from another data type and format","text":"

The healdata-utils variable-level metadata (vlmd) tool inputs a variety of different input file types and extracts HEAL-compliant data dictionaries (JSON and CSV formats). Additionally, exported validation (i.e., \"error\") reports provide the user information as to a) if the exported data dictionary is valid according to HEAL specifications and b) how to modify one's data dictionary to make it HEAL-compliant.

Warning

Currently the python subcommand is convert but will be changed to extract_to_vlmd to be consistent with CLI. extract was chosen to better reflect the functionality.

Command Line Interface (CLI)Python
vlmd extract --inputtype spss myproject/myfile.sav\n

Note

To continue, it's recommended to go to the input types and formats. Also, for more details on the different flags/options, run vlmd --help

from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myproject/myfile.sav\",inputtype=\"spss\")\n

Note

To continue, it's recommended to go to the input types and formats. For a complete set of options with convert_to_vlmd see the docstring (if in a notebook, one can enter convert_to_vlmd?)

"},{"location":"vlmd/extract/#input-types-and-formats","title":"Input Types and Formats","text":"

This section provides the specific syntax for running each of the supported types for generating HEAL-compliant data dictionaries are listed. Additional instructions on how to obtain the necessary input files/software are also provided.

Note

To further annotate your outputted data dictionaries, see the variable-level metadata field properties (with examples) for either the csv data dictionary click here or the json data dictionary click here.

Extract variable level metadata from your data:

"},{"location":"vlmd/extract/#output","title":"Output","text":"

Both the python and command line routes will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors folder. See below:

If valid, this file will contain:

{\n\"valid\": true,\n\"errors\": []\n}\n
- errors/heal-json-errors.json: outputted jsonschema validation report.

If no outputdir specified, the resulting HEAL-compliant data dictionaries will be named:

"},{"location":"vlmd/extract/csvdata/","title":"csv Datasets","text":"

CSV (comma-separated values) is the main open tabular data format for storage and exchange. It is easy to create and understand using basic text editors in addition to popular spreadsheet software like Google Sheets and Excel. Importantly, CSVs are simple and can be easily integrated into web applications and just about any software.

Currently, the HEAL Data Utilities vlmd function can infer a minimal, HEAL-compliant dataset by inferring name, type, and enum (i.e., possible values). After this minimal data dictionary is generated, the researcher can further annotate it with fields' description and other optional properties in either the HEAL-compliant csv- or json-formatted data dictionary (see the HEAL data dictionary template sections below for more information).

"},{"location":"vlmd/extract/exceldata/","title":"Excel (xlsx) dataset","text":"

Excel workbooks contain tabular data tables across named worksheets.

This vlmd extraction tool provides the ability to extract vlmd from all of these worksheets either as a combined data dictionary or as multiple data dictionaries.

"},{"location":"vlmd/extract/exceldata/#run-the-vlmd-command","title":"Run the vlmd command","text":"CLIPython
vlmd extract --inputtype excel-data myexcelfile.xlsx\n
"},{"location":"vlmd/extract/exceldata/#to-output-multiple-sheets-as-separate-data-dictionaries","title":"To output multiple sheets as separate data dictionaries","text":"
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myexcelfile.xlsx\",inputtype=\"excel-data\")\n
"},{"location":"vlmd/extract/exceldata/#to-extract-multiple-sheets-as-one-data-dictionary","title":"To extract multiple sheets as one data dictionary","text":"

Note

Be careful about using the multiple_data_dicts=False. In most instances, one sheet should correspond to one separate data table and thus have one corresponding data dictionary.

Note, this combines (ie concatenates all data tables) and then infers fields. This use case is when sheets are viewed as \"chunks\" of one resource/dataset.

from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n    input_filepath=\"myexcelfile.xlsx\",\n    inputtype=\"excel-data\",\n    multiple_data_dicts=False\n    )\n
"},{"location":"vlmd/extract/exceldata/#to-extract-a-subset-of-sheets-as-one-data-dictionary","title":"To extract a subset of sheets as one data dictionary","text":"
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n    input_filepath=\"myexcelfile.xlsx\",\n    inputtype=\"excel-data\",\n    multiple_data_dicts=False,\n    sheet_name=[\"mysheet1\",\"mysheet2\"]\n    )\n
"},{"location":"vlmd/extract/frictionlessschema/","title":"Frictionless Table Schema","text":"

While vlmd specifications are designed (and still being developed), to support interoperability with the heal platform, minor naming translations may be needed. This function supports any of said translations (eg., frictionless fields name --> heal data_dictionary)

Note, this conversion supports either yaml or json format (currently only tests for json format but should work with yaml).

"},{"location":"vlmd/extract/frictionlessschema/#creating-a-frictionless-table-schema","title":"Creating a frictionless table schema","text":"

Below are the official frictionless table schema specifications, which you will notice a high degree of overlap with the heal variable level metadata specifications.

See here for the frictionless table schema specs

"},{"location":"vlmd/extract/frictionlessschema/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype frictionless data/frictionless_dataset1.frictionless.schema.json\n
"},{"location":"vlmd/extract/redcapcsv/","title":"REDCap: Data Dictionary CSV Export","text":"

For users collecting data in a REDCap data management system, HEAL-compliant data dictionaries can be generated directly from REDCap exports.

The REDCap data dictionary export serves the purpose of providing variable-level metadata in a standardized, tabular format and is generally easy to export. The HEAL data utilities leverages this user experience and standardized format to enable HEAL researchers to generate a Heal-compliant data dictionary.

"},{"location":"vlmd/extract/redcapcsv/#export-your-redcap-data-dictionary","title":"Export your Redcap data dictionary","text":"

To download a REDCap CSV export, do the following*:

  1. After logging in to your REDCap project page, locate the Data dictionary page. A link to this page may be available on the project side bar (see image below) or in the Project Setup tab at the top of your page.

  1. After arriving at the Data dictionary page, click on Download the current data dictionary to export the dictionary (see below).

*there may be slight differences depending on your specific REDCap instance and version

"},{"location":"vlmd/extract/redcapcsv/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype redcap input/example_redcap_demo.redcap.csv 
"},{"location":"vlmd/extract/sas/","title":"SAS sas7bdat (and sas7bcat) files","text":"

To accommodate SAS users, the HEAL Data Utilities supports the binary sas7bdat file format, which contains the actual data values (observations/records). This file also includes variable metadata (variable names and variable labels/ descriptions).

The HEAL Data Utilities also provides the option to include a catalog file \u2013 sas7bcat format - with the sas7bdat. A sas7bcat file contains variable value labels, or encodings, that can be mapped onto the corresponding data from a sas7bdat file.

"},{"location":"vlmd/extract/sas/#creating-a-sas7bdat-and-a-sas7bcat-file","title":"Creating a sas7bdat and a sas7bcat file","text":"

Many SAS users build formats and labels into their data processing and analysis scripts. In this section, we provide syntax that can be easily copy-pasted into these existing workflows to create sas7bdat and sas7bcat files to input into the vlmd tool.

This script template can be run separately or inserted directly at the end of a SAS user's workflow.

Note

If inserted directly, remember to delete the lines with %INCLUDE)

Template template.sas
/*1. Read in data file without value labels and run full code. \n        Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n        that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/\n\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SECOND SEPARATE SAS SCRIPT*/ \n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file. If you already have an out directory assigned, skip this step and replace \u201cout\u201d with your out directory libname in the flow*/\n\nlibname out \"<INSERT THE DESIRED LOCATION (FILE PATH) TO YOUR SAS7BCAT AND SAS7BDAT FILES HERE>\";\n\n/*2b. Output the format catalog.\n        Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n        out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n    copy out=out.FORMATS;\n    run;\n\n/*3. Output the data file (sas7bdat) */\ndata out.yourdata;\n    set <INSERT THE NAME OF YOUR FINAL SAS DATASET HERE>;\n    run;\n

The below SAS syntax is an example of how to use the template within your SAS workflow.

The below sample script creates all of our variable and value labels. Your workflow may include multiple SAS scripts with multiple format statements and may include analyses and other PROC calls for data exploration, but for demonstration purposes, this example only uses one script and focuses on defining the variable and value labels.

Example my_existing_sas_workflow.sas
/*1. Read in input data */\nproc import datafile=\"myprojectfolder/input/mydata.csv\"\n    out=raw\n    dbms=csv replace;\n    getnames=yes;\nrun;\n\n/*2. Set up proc format and apply formats and variable labels in data step */\n/*Create encodings (value labels)*/\nproc format;\n    VALUE YESNO\n    0       =\"No\"\n    1       =\"Yes\"\n\n    VALUE PUBLIC\n    1='State mental health authority (SMHA)'\n    2='Other state government agency or department'\n    3='Regional/district authority or county, local, or municipal government'\n    4='Tribal government'\n    5='Indian Health Service'\n    6='Department of Veterans Affairs'\n    7='Other'\n\n    VALUE FOCUS\n    1='Mental health treatment'\n    2='Substance abuse treatment'\n    3='Mix of mental health and substance abuse treatment (neither is primary)'\n    4='General health care'\n    5='Other service focus';\n\n**Apply formats to dataset;\ndata processed;\n    set raw;\n\n    /*Assign formats*/\n    format YOUNGADULTS TREATPSYCHOTHRPY TREATTRAUMATHRPY YESNO. FOCUS FOCUS. PUBLIC PUBLIC.;\n    /*Add variable labels*/\n    label YOUNGADULTS=\"Accepts young adults (aged 18-25 years old) for Tx\"\n            TREATPSYCHOTHRPY=\"Facility offers individual psychotherapy\"\n            TREATTRAUMATHRPY=\"Facility offers trauma therapy\"\n            FOCUS=\"Primary treatment focus of facility\"\n            PUBLIC=\"Public agency or department that operates facility\";\nrun;\n

This second script called my_output.sas is the filled out template. Note the %INCLUDE function that calls my_existing_sas_workflow.sas

my_output.sas
/*1. Read in data file without value labels and run full code. \n        Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n        that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/*/\n\n%INCLUDE \"myprojectfolder/my_existing_workflow.sas\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file.*/\nlibname out \"myprojectfolder/output\";\n\n/*2b. Output the format catalog.\n        Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n        out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n    copy out=out.FORMATS;\n    run;\n\n/*3. Output the data file (sas7bdat) to your output folder*/\ndata out.yourdata;\n    set processed;\n    run;\n
"},{"location":"vlmd/extract/sas/#run-the-vlmd-command","title":"Run the vlmd command","text":"

After creating the necessary sas7bdat and sas7bcat files, you can then run the vlmd command. The tool, will automatically detect the sas7bcat file if located in the same directory as your data file. If not detected, the command will run without the sas7bcat catalog file and the encodings (i.e., value labels) will not be extracted from the catalog file.

vlmd extract --inputtype sas input/data.sas7bdat 
"},{"location":"vlmd/extract/spss/","title":"SPSS .sav files","text":"

For SPSS users, the HEAL Data Utilities generates HEAL-compliant data dictionaries from SPSS's default file format for storing datasets: a SAV file. It stores not only the data itself but also metadata such as variable names, variable labels, types, and value labels. The HEAL Data Utilities extracts these data and metadata to create HEAL-compliant data dictionaries.

"},{"location":"vlmd/extract/spss/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype spss data/example_pyreadstat_output.sav 
"},{"location":"vlmd/extract/stata/","title":"Stata .dta files","text":"

For Stata users, the HEAL Data Utilities generates HEAL-compliant data dictionaries through Stata's default file format: a DTA file. DTA files store not only the data itself but also metadata such as variable names, variable labels, types, and value labels.

"},{"location":"vlmd/extract/stata/#run-the-vlmd-command","title":"Run the vlmd command","text":"
vlmd extract --inputtype stata data/mydatafile.dta 
"},{"location":"vlmd/schemas/","title":"HEAL data dictionary schemas","text":"

Click on each data dictionary schema below to view information about each format's data dictionary properties (such as a description, examples, etc).

CSV fields

JSON data dictionary

Note

enum type means that a field can only be one of a certain set of possible values.

"},{"location":"vlmd/schemas/csv-fields/","title":"Tabular (CSV) data dictionary","text":"HEAL Variable Level Metadata Fields HEAL Variable Level Metadata Fields Type: object

Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.

Note, only name and description are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root moduleType: string

The section, form, survey instrument, set of measures or other broad category used to group variables.

Examples:
\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\n
name Required root nameType: string

The name of a variable (i.e., field) as it appears in the data.

[Required]

title root titleType: string

The human-readable title or label of the variable.

[Highly recommended]

Example:
\"My Variable (for name of my_variable)\"\n
description Required root descriptionType: string

An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).

[Required]

Examples:
\"Definition\"\n
\"Question text (if a survey)\"\n
type root typeType: enum (of string)

A classification or category of a particular data element or property expected or allowed in the dataset.

Must be one of: format root format

A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification

Each format is dependent on the type specified. For example: If type is \"string\", then see the String formats. If type is one of the date-like formats, then see Date formats.

Any of root format anyOf String FormatType: enum (of string) Must be one of: root format anyOf Date FormatType: object

A format for a date variable (date,time,datetime). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strftime.

\\nExamples:

%Y-%m-%d (for date, e.g., 2023-05-25) %Y%-%d (for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S (for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ (for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z (for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M (for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H (for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S (for time, e.g., 10:30:45) %H:%M:%SZ (for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z (for time with timezone offset, e.g., 10:30:45+0300)

root format anyOf Geopoint Format

The two types of formats for geopoint (describing a geographic point).

One of root format anyOf Geopoint Format oneOf item 0Type: array

A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.

root format anyOf Geopoint Format oneOf item 1Type: object

Contains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.

root format anyOf geojsonType: enum (of string)

The JSON object according to the geojson spec.

Must be one of: constraints.maxLength root constraints.maxLengthType: integer

Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.

[Optional,if applicable]

constraints.enum root constraints.enumType: string

Constrains possible values to a set of values.

[Optional,if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ constraints.pattern root constraints.patternType: string

A regular expression pattern the data MUST conform to.

[Optional,if applicable]

constraints.maximum root constraints.maximumType: integer

Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.

[Optional,if applicable]

encodings root encodingsType: string

Variable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.

Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.

Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).

[Optional,if applicable]

Must match regular expression: ^(?:.*?=.*?(?:\\||$))+$ Examples:
\"0=No|1=Yes\"\n
\"HW=Hello world|GBW=Good bye world|HM=Hi,Mike\"\n
ordered root orderedType: boolean

Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).

[Optional,if applicable]

missingValues root missingValuesType: string

A list of missing values specific to a variable.

[Optional, if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ trueValues root trueValuesType: string

For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.

[Optional, if applicable]

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ Examples:
\"Required|REQUIRED\"\n
\"required|Yes|Y|Checked\"\n
\"Checked\"\n
\"Required\"\n
falseValues root falseValuesType: string

For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.

Must match regular expression: ^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$ repo_link root repo_linkType: string

A link to the variable as it exists on the home repository, if applicable

cde_id.source root cde_id.sourceType: string cde_id.id root cde_id.idType: string ontology_id.relation root ontology_id.relationType: string ontology_id.source root ontology_id.sourceType: string ontology_id.id root ontology_id.idType: string standardsMappings.type root standardsMappings.typeType: string

The type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Examples:
\"cde\"\n
\"ontology\"\n
\"reference_list\"\n
standardsMappings.label root standardsMappings.labelType: string

A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.

[Autopopulated, if not filled]

Examples:
\"substance use\"\n
\"chemical compound\"\n
\"promis\"\n
standardsMappings.url root standardsMappings.urlType: stringFormat: uri

The url that links out to the published, standardized mapping.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
standardsMappings.source root standardsMappings.sourceType: string

The source of the standardized variable.

Example:
\"TBD (will have controlled vocabulary)\"\n
standardsMappings.id root standardsMappings.idType: string

The id locating the individual mapping within the given source.

relatedConcepts.type root relatedConcepts.typeType: string

The type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

relatedConcepts.label root relatedConcepts.labelType: string

A free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

relatedConcepts.url root relatedConcepts.urlType: stringFormat: uri

The url that links out to the published, standardized concept.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
relatedConcepts.source root relatedConcepts.sourceType: string

The source of the related concept.

[Autopopulated, if not filled]

Example:
\"TBD (will have controlled vocabulary)\"\n
relatedConcepts.id root relatedConcepts.idType: string

The id locating the individual mapping within the given source.

[Autopopulated, if not filled]

univarStats.median root univarStats.medianType: number univarStats.mean root univarStats.meanType: number univarStats.std root univarStats.stdType: number univarStats.min root univarStats.minType: number univarStats.max root univarStats.maxType: number univarStats.mode root univarStats.modeType: number univarStats.count root univarStats.countType: integer

Value must be greater or equal to 0

univarStats.twentyFifthPercentile root univarStats.twentyFifthPercentileType: number univarStats.seventyFifthPercentile root univarStats.seventyFifthPercentileType: number univarStats.categoricalMarginals.name root univarStats.categoricalMarginals.nameType: string univarStats.categoricalMarginals.count root univarStats.categoricalMarginals.countType: integer Additional Properties

Additional Properties of any type are allowed.

root additionalPropertiesType: object

Generated using json-schema-for-humans on 2023-07-05 at 17:11:06 -0500

"},{"location":"vlmd/schemas/json-data-dictionary/","title":"JSON data dictionary","text":"Variable Level Metadata (Data Dictionaries) Variable Level Metadata (Data Dictionaries) Type: object

This schema defines the variable level metadata for one data dictionary for a given study.Note a given study can have multiple data dictionaries

title Required root titleType: string description root descriptionType: string data_dictionary Required root data_dictionaryType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata FieldsType: object

Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.

Note, only name and description are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root data_dictionary HEAL Variable Level Metadata Fields moduleType: string

The section, form, survey instrument, set of measures or other broad category used to group variables.

Examples:
\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\n
name Required root data_dictionary HEAL Variable Level Metadata Fields nameType: string

The name of a variable (i.e., field) as it appears in the data.

[Required]

title root data_dictionary HEAL Variable Level Metadata Fields titleType: string

The human-readable title or label of the variable.

[Highly recommended]

Example:
\"My Variable (for name of my_variable)\"\n
description Required root data_dictionary HEAL Variable Level Metadata Fields descriptionType: string

An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).

[Required]

Examples:
\"Definition\"\n
\"Question text (if a survey)\"\n
type root data_dictionary HEAL Variable Level Metadata Fields typeType: enum (of string)

A classification or category of a particular data element or property expected or allowed in the dataset.

Must be one of: format root data_dictionary HEAL Variable Level Metadata Fields format

A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification

Each format is dependent on the type specified. For example: If type is \"string\", then see the String formats. If type is one of the date-like formats, then see Date formats.

Any of root data_dictionary HEAL Variable Level Metadata Fields format anyOf String FormatType: enum (of string) Must be one of: root data_dictionary HEAL Variable Level Metadata Fields format anyOf Date FormatType: object

A format for a date variable (date,time,datetime). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}, which MUST follow the date formatting syntax of C / Python strftime.

\\nExamples:

%Y-%m-%d (for date, e.g., 2023-05-25) %Y%-%d (for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S (for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ (for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z (for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M (for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H (for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S (for time, e.g., 10:30:45) %H:%M:%SZ (for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z (for time with timezone offset, e.g., 10:30:45+0300)

root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format

The two types of formats for geopoint (describing a geographic point).

One of root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 0Type: array

A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.

root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 1Type: object

Contains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.

root data_dictionary HEAL Variable Level Metadata Fields format anyOf geojsonType: enum (of string)

The JSON object according to the geojson spec.

Must be one of: constraints root data_dictionary HEAL Variable Level Metadata Fields constraintsType: object maxLength root data_dictionary HEAL Variable Level Metadata Fields constraints maxLengthType: integer

Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.

[Optional,if applicable]

enum root data_dictionary HEAL Variable Level Metadata Fields constraints enumType: array

Constrains possible values to a set of values.

[Optional,if applicable]

pattern root data_dictionary HEAL Variable Level Metadata Fields constraints patternType: string

A regular expression pattern the data MUST conform to.

[Optional,if applicable]

maximum root data_dictionary HEAL Variable Level Metadata Fields constraints maximumType: integer

Specifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.

[Optional,if applicable]

encodings root data_dictionary HEAL Variable Level Metadata Fields encodingsType: object

Variable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.

Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.

Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).

[Optional,if applicable]

Examples:
{\n\"0\": \"No\",\n\"1\": \"Yes\"\n}\n
{\n\"HW\": \"Hello world\",\n\"GBW\": \"Good bye world\",\n\"HM\": \"Hi, Mike\"\n}\n
ordered root data_dictionary HEAL Variable Level Metadata Fields orderedType: boolean

Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).

[Optional,if applicable]

missingValues root data_dictionary HEAL Variable Level Metadata Fields missingValuesType: array

A list of missing values specific to a variable.

[Highly recommended]

trueValues root data_dictionary HEAL Variable Level Metadata Fields trueValuesType: array of string

For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.

[Optional, if applicable]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields trueValues trueValues itemsType: string Examples:
\"Required\"\n
\"REQUIRED\"\n
\"required\"\n
\"Yes\"\n
\"Checked\\\"\"\n
falseValues root data_dictionary HEAL Variable Level Metadata Fields falseValuesType: array

For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.

repo_link root data_dictionary HEAL Variable Level Metadata Fields repo_linkType: string

A link to the variable as it exists on the home repository, if applicable

cde_id root data_dictionary HEAL Variable Level Metadata Fields cde_idType: array of object

[FUTURE WARNING: WILL BE DEPRECATED] Use standardsMapping. The source and id for the NIH Common Data Elements program.

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id itemsType: object source root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id items sourceType: string id root data_dictionary HEAL Variable Level Metadata Fields cde_id cde_id items idType: string ontology_id root data_dictionary HEAL Variable Level Metadata Fields ontology_idType: array of object

[FUTURE WARNING: WILL BE DEPRECATED] - Use relatedConcepts. Ontological information for the given variable as indicated by the source, id, and relation to the specified classification. One or more ontology classifications can be specified.

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id itemsType: object relation root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items relationType: string source root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items sourceType: string id root data_dictionary HEAL Variable Level Metadata Fields ontology_id ontology_id items idType: string standardsMappings root data_dictionary HEAL Variable Level Metadata Fields standardsMappingsType: array of object

A published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items typeType: string

The type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]

Examples:
\"cde\"\n
\"ontology\"\n
\"reference_list\"\n
label root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items labelType: string

A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.

[Autopopulated, if not filled]

Examples:
\"substance use\"\n
\"chemical compound\"\n
\"promis\"\n
url root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items urlType: stringFormat: uri

The url that links out to the published, standardized mapping.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
source root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items sourceType: string

The source of the standardized variable.

Example:
\"TBD (will have controlled vocabulary)\"\n
id root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items idType: string

The id locating the individual mapping within the given source.

relatedConcepts root data_dictionary HEAL Variable Level Metadata Fields relatedConceptsType: array of object

Mappings to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc) [Autopopulated, if not filled]

Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items typeType: string

The type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

label root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items labelType: string

A free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)

[Autopopulated, if not filled]

url root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items urlType: stringFormat: uri

The url that links out to the published, standardized concept.

[Autopopulated, if not filled]

Example:
\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\n
source root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items sourceType: string

The source of the related concept.

[Autopopulated, if not filled]

Example:
\"TBD (will have controlled vocabulary)\"\n
id root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items idType: string

The id locating the individual mapping within the given source.

[Autopopulated, if not filled]

univarStats root data_dictionary HEAL Variable Level Metadata Fields univarStatsType: object

Univariate statistics inferred from the data about the given variable

[Experimental]

median root data_dictionary HEAL Variable Level Metadata Fields univarStats medianType: number mean root data_dictionary HEAL Variable Level Metadata Fields univarStats meanType: number std root data_dictionary HEAL Variable Level Metadata Fields univarStats stdType: number min root data_dictionary HEAL Variable Level Metadata Fields univarStats minType: number max root data_dictionary HEAL Variable Level Metadata Fields univarStats maxType: number mode root data_dictionary HEAL Variable Level Metadata Fields univarStats modeType: number count root data_dictionary HEAL Variable Level Metadata Fields univarStats countType: integer

Value must be greater or equal to 0

twentyFifthPercentile root data_dictionary HEAL Variable Level Metadata Fields univarStats twentyFifthPercentileType: number seventyFifthPercentile root data_dictionary HEAL Variable Level Metadata Fields univarStats seventyFifthPercentileType: number categoricalMarginals root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginalsType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals itemsType: object name root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals items nameType: string count root data_dictionary HEAL Variable Level Metadata Fields univarStats categoricalMarginals categoricalMarginals items countType: integer Additional Properties

Additional Properties of any type are allowed.

root data_dictionary HEAL Variable Level Metadata Fields additionalPropertiesType: object

Generated using json-schema-for-humans on 2023-07-03 at 09:08:41 -0500

"},{"location":"vlmd/start/","title":"Start from a template","text":"

Some folks may prefer to create their HEAL data dictionary from scratch. To support this, we have created a utility that creates either a json or csv template.

Warning

Currently, the command is template but will change to start to be consistent with the verb subcommand vocabulary.

"},{"location":"vlmd/start/#csv-template","title":"csv template","text":"

The HEAL Data Utilities can also input a csv HEAL data dictionary either from a manually filled out template or as an additional step after further annotation (e.g., from the csv HEAL data dictionary output of the other file formats).

To create a template csv version with 10 fields (variables):

Command line interface (CLI)Python
vlmd template myhealdd.csv --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.csv\"),numfields=10)\n

Click here to download an example of a filled out csv HEAL data dictionary template

"},{"location":"vlmd/start/#json-template","title":"json template","text":"

While the csv HEAL data dictionary provides a tabular format for HEAL-compliant data dictionaries, ultimately, these csv data dictionary files are converted to a json file (the most common format to store and exchange data within web applications such as the HEAL Data Platform).

Another advantage of json HEAL data dictionaries is that one can specify metadata describing the data dictionary as a whole (e.g., the description and title).

To create a template json version with 10 fields (variables):

Command line interface (CLI)Python
vlmd template myhealdd.json --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.json\"),numfields=10)\n

Click here to download an example of filled out json HEAL data dictionary template

"},{"location":"vlmd/validate/","title":"Validate Check (validate) an existing HEAL data dictionary file","text":"

Will indicate if the data dictionary complies with the HEAL specifications.

Command line interface (CLI)Python
vlmd validate data/myhealcsvdd.csv\n\nvlmd validate data/myhealjsondd.json\n
from healdata_utils import validate_vlmd_csv,validate_vlmd_json\n\nvalidate_vlmd_csv(\"data/myhealcsvdd.csv\")\n\nvalidate_vlmd_json(\"data/myhealjsondd.json\")\n
"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 77def968778ef4a35e2ef438d57c1ec98c1226a5..71e91679102e7b2a40153c6612fedf0fa02319bc 100644 GIT binary patch delta 15 Wcmcb>bb*OYzMF%iY2ijTFGc_-as-M1 delta 15 Wcmcb>bb*OYzMF%?YW_wxFGc_*xCB4| diff --git a/vlmd/extract/exceldata/index.html b/vlmd/extract/exceldata/index.html index 412a258..d92922d 100644 --- a/vlmd/extract/exceldata/index.html +++ b/vlmd/extract/exceldata/index.html @@ -942,20 +942,21 @@

To extract multiple s
from healdata_utils import convert_to_vlmd
 
 convert_to_vlmd(
-    filepath="myexcelfile.xlsx",
+    input_filepath="myexcelfile.xlsx",
     inputtype="excel-data",
     multiple_data_dicts=False
     )
 

To extract a subset of sheets as one data dictionary

-

```python

-

from healdata_utils import convert_to_vlmd

-

convert_to_vlmd( - filepath="myexcelfile.xlsx", - inputtype="excel-data", - multiple_data_dicts=False, - sheet_name=["mysheet1","mysheet2"] - )

+
from healdata_utils import convert_to_vlmd
+
+convert_to_vlmd(
+    input_filepath="myexcelfile.xlsx",
+    inputtype="excel-data",
+    multiple_data_dicts=False,
+    sheet_name=["mysheet1","mysheet2"]
+    )
+