From 163e75007d70999fb0d366db68203b2b93b2bc95 Mon Sep 17 00:00:00 2001 From: <> Date: Fri, 8 Sep 2023 22:34:42 +0000 Subject: [PATCH] Deployed 84f723d with MkDocs version: 1.5.2 --- search/search_index.json | 2 +- sitemap.xml.gz | Bin 336 -> 336 bytes vlmd/extract/exceldata/index.html | 19 ++++++++++--------- 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/search/search_index.json b/search/search_index.json index c1b6017..411cbff 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"HEAL Data Utilities","text":"
The HEAL Data Utilities python package provides data packaging tools for the HEAL Data Ecosystem to facilitate data discovery, sharing, and harmonization on the HEAL Platform.
Currently, the focus of this repository is generating standardized variable level metadata (VLMD) in the form of data dictionaries. See the quick start section to get started without installing any of the prerequisites. (Click here for the Variable-level Metadata documentation section).
However, in the future, this will be expanded for all HEAL-specific data packaging functions (e.g., study- and file-level metadata and data).
"},{"location":"#quick-start","title":"Quick start","text":"Note
If using the quick start option, no prerequisites are required.
Double click on the vlmd
(or vlmd.exe
) executable or run the vlmd
executable without any arguments to quickly start using this tool. This \"quick start\" will take walk you through step by step by prompting you of the various options.
Important
Stand alone applications for different operating systems are available here. These allow you to run the vlmd
tool without needing to install anything else. Just (1) download, (2) unzip, and (3) double click on the vlmd
application icon.
While the HEAL Data Utilities should be compatible with most versions of Python, you can download the latest version of Python here and install it on your local computer. We recommend installing Python version 3.10.
"},{"location":"#installation","title":"Installation","text":"To install the latest official release of healdata-utils, from your computer's command prompt, run:
pip install healdata-utils
OR for the most up-to-date unreleased version run:
pip install git+https://github.com/norc-heal/healdata-utils.git
Note
Installing the unreleased version requires having git
software installed.
Variable level metadata (VLMD), in the form of standardized data dictionaries, provides an exciting opportunity:
For an example of this searchability in the context of study level metadata, see the platform's discovery page
When data is available, VLMD provides a way to validate the data as well.
Supports HEAL projects and goals such as the common data elements program
extract
: Extract the variable level metadata from an existing file with a specific type/format
start
: Start a data dictionary from an empty template
validate
: Check (validate) an existing HEAL data dictionary file to see if it follows the HEAL specifications after filling out a template or further annotation after extracting from a different format.
Typical workflows for creating a HEAL-compliant data dictionary include:
Create your data dictionary
(a) Run the vlmd extract
command (or convert_to_vlmd
if in python) to generate a HEAL-compliant data dictionary via your desired input format
(b) Run the vlmd template
command to start from an empty template.
Add/annotate with additional information in your preferred HEAL data dictionary format (either json
or csv
).
csv
data dictionaryjson
data dictionaryRun the vlmd validate
command with your HEAL data dictioanry as the input to validate.
Repeat (2) and (3) until you are ready to submit. Please note, currently only name
and description
are required.
Important
The main difference* between the CSV and JSON definitions lies in the way the data dictionaries are structured and the additional metadata included in the JSON data dictionary.
The CSV data dictionary is a plain tabular representation with no additional metadata, while the JSON dataset includes fields along with additional metadata in the form of a root description and title.
For more information on variable-level metadata properties (fields), see the csv
field specification and json
data dictionary specification.
Extract
VLMD from another data type and format","text":"The healdata-utils variable-level metadata (vlmd) tool inputs a variety of different input file types and extracts HEAL-compliant data dictionaries (JSON and CSV formats). Additionally, exported validation (i.e., \"error\") reports provide the user information as to a) if the exported data dictionary is valid according to HEAL specifications and b) how to modify one's data dictionary to make it HEAL-compliant.
Warning
Currently the python subcommand is convert
but will be changed to extract_to_vlmd
to be consistent with CLI. extract
was chosen to better reflect the functionality.
vlmd extract --inputtype spss myproject/myfile.sav\n
Note
To continue, it's recommended to go to the input types and formats. Also, for more details on the different flags/options, run vlmd --help
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myproject/myfile.sav\",inputtype=\"spss\")\n
Note
To continue, it's recommended to go to the input types and formats. For a complete set of options with convert_to_vlmd
see the docstring (if in a notebook, one can enter convert_to_vlmd?
)
This section provides the specific syntax for running each of the supported types for generating HEAL-compliant data dictionaries are listed. Additional instructions on how to obtain the necessary input files/software are also provided.
Note
To further annotate your outputted data dictionaries, see the variable-level metadata field properties (with examples) for either the csv data dictionary
click here or the json data dictionary
click here.
Extract variable level metadata from your data:
Both the python and command line routes will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors
folder. See below:
errors/heal-csv-errors.json
: outputted validation report for table in csv file against frictionless schemaIf valid, this file will contain:
{\n\"valid\": true,\n\"errors\": []\n}\n
- errors/heal-json-errors.json
: outputted jsonschema validation report. {\n\"valid\": true,\n\"errors\": []\n}\n
If no outputdir
specified, the resulting HEAL-compliant data dictionaries will be named:
heal-csvtemplate-data-dictionary.csv
: This is the CSV data dictionaryheal-jsontemplate-data-dictionary.json
: This is the JSON version of the data dictionarycsv
Datasets","text":"CSV (comma-separated values) is the main open tabular data format for storage and exchange. It is easy to create and understand using basic text editors in addition to popular spreadsheet software like Google Sheets and Excel. Importantly, CSVs are simple and can be easily integrated into web applications and just about any software.
Currently, the HEAL Data Utilities vlmd
function can infer a minimal, HEAL-compliant dataset by inferring name
, type
, and enum
(i.e., possible values). After this minimal data dictionary is generated, the researcher can further annotate it with fields' description
and other optional properties in either the HEAL-compliant csv
- or json
-formatted data dictionary (see the HEAL data dictionary template sections below for more information).
Excel workbooks contain tabular data tables across named worksheets.
This vlmd extraction tool provides the ability to extract vlmd from all of these worksheets either as a combined data dictionary or as multiple data dictionaries.
"},{"location":"vlmd/extract/exceldata/#run-the-vlmd-command","title":"Run thevlmd
command","text":"CLIPython vlmd extract --inputtype excel-data myexcelfile.xlsx\n
"},{"location":"vlmd/extract/exceldata/#to-output-multiple-sheets-as-separate-data-dictionaries","title":"To output multiple sheets as separate data dictionaries","text":"from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myexcelfile.xlsx\",inputtype=\"excel-data\")\n
"},{"location":"vlmd/extract/exceldata/#to-extract-multiple-sheets-as-one-data-dictionary","title":"To extract multiple sheets as one data dictionary","text":"Note
Be careful about using the multiple_data_dicts=False
. In most instances, one sheet should correspond to one separate data table and thus have one corresponding data dictionary.
Note, this combines (ie concatenates all data tables) and then infers fields. This use case is when sheets are viewed as \"chunks\" of one resource/dataset.
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n filepath=\"myexcelfile.xlsx\",\n inputtype=\"excel-data\",\n multiple_data_dicts=False\n )\n
"},{"location":"vlmd/extract/exceldata/#to-extract-a-subset-of-sheets-as-one-data-dictionary","title":"To extract a subset of sheets as one data dictionary","text":"```python
from healdata_utils import convert_to_vlmd
convert_to_vlmd( filepath=\"myexcelfile.xlsx\", inputtype=\"excel-data\", multiple_data_dicts=False, sheet_name=[\"mysheet1\",\"mysheet2\"] )
"},{"location":"vlmd/extract/frictionlessschema/","title":"Frictionless Table Schema","text":"While vlmd specifications are designed (and still being developed), to support interoperability with the heal platform, minor naming translations may be needed. This function supports any of said translations (eg., frictionless fields
name --> heal data_dictionary
)
Note, this conversion supports either yaml
or json
format (currently only tests for json
format but should work with yaml).
Below are the official frictionless table schema specifications, which you will notice a high degree of overlap with the heal variable level metadata specifications.
See here for the frictionless table schema specs
"},{"location":"vlmd/extract/frictionlessschema/#run-the-vlmd-command","title":"Run thevlmd
command","text":"vlmd extract --inputtype frictionless data/frictionless_dataset1.frictionless.schema.json\n
"},{"location":"vlmd/extract/redcapcsv/","title":"REDCap: Data Dictionary CSV Export","text":"For users collecting data in a REDCap data management system, HEAL-compliant data dictionaries can be generated directly from REDCap exports.
The REDCap data dictionary export serves the purpose of providing variable-level metadata in a standardized, tabular format and is generally easy to export. The HEAL data utilities leverages this user experience and standardized format to enable HEAL researchers to generate a Heal-compliant data dictionary.
"},{"location":"vlmd/extract/redcapcsv/#export-your-redcap-data-dictionary","title":"Export your Redcap data dictionary","text":"To download a REDCap CSV export, do the following*:
Data dictionary
page. A link to this page may be available on the project side bar (see image below) or in the Project Setup tab
at the top of your page.Data dictionary
page, click on Download the current data dictionary
to export the dictionary (see below).*there may be slight differences depending on your specific REDCap instance and version
"},{"location":"vlmd/extract/redcapcsv/#run-the-vlmd-command","title":"Run thevlmd
command","text":"vlmd extract --inputtype redcap input/example_redcap_demo.redcap.csv
"},{"location":"vlmd/extract/sas/","title":"SAS sas7bdat
(and sas7bcat
) files","text":"To accommodate SAS users, the HEAL Data Utilities supports the binary sas7bdat
file format, which contains the actual data values (observations/records). This file also includes variable metadata (variable names
and variable labels/ descriptions
).
The HEAL Data Utilities also provides the option to include a catalog file \u2013 sas7bcat
format - with the sas7bdat
. A sas7bcat
file contains variable value labels, or encodings
, that can be mapped onto the corresponding data from a sas7bdat
file.
sas7bdat
and a sas7bcat
file","text":"Many SAS users build formats and labels into their data processing and analysis scripts. In this section, we provide syntax that can be easily copy-pasted into these existing workflows to create sas7bdat
and sas7bcat
files to input into the vlmd
tool.
This script template can be run separately or inserted directly at the end of a SAS user's workflow.
Note
If inserted directly, remember to delete the lines with %INCLUDE
)
/*1. Read in data file without value labels and run full code. \n Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/\n\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SECOND SEPARATE SAS SCRIPT*/ \n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file. If you already have an out directory assigned, skip this step and replace \u201cout\u201d with your out directory libname in the flow*/\n\nlibname out \"<INSERT THE DESIRED LOCATION (FILE PATH) TO YOUR SAS7BCAT AND SAS7BDAT FILES HERE>\";\n\n/*2b. Output the format catalog.\n Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n copy out=out.FORMATS;\n run;\n\n/*3. Output the data file (sas7bdat) */\ndata out.yourdata;\n set <INSERT THE NAME OF YOUR FINAL SAS DATASET HERE>;\n run;\n
The below SAS syntax is an example of how to use the template within your SAS workflow.
The below sample script creates all of our variable and value labels. Your workflow may include multiple SAS scripts with multiple format statements and may include analyses and other PROC calls for data exploration, but for demonstration purposes, this example only uses one script and focuses on defining the variable and value labels.
Example my_existing_sas_workflow.sas/*1. Read in input data */\nproc import datafile=\"myprojectfolder/input/mydata.csv\"\n out=raw\n dbms=csv replace;\n getnames=yes;\nrun;\n\n/*2. Set up proc format and apply formats and variable labels in data step */\n/*Create encodings (value labels)*/\nproc format;\n VALUE YESNO\n 0 =\"No\"\n 1 =\"Yes\"\n\n VALUE PUBLIC\n 1='State mental health authority (SMHA)'\n 2='Other state government agency or department'\n 3='Regional/district authority or county, local, or municipal government'\n 4='Tribal government'\n 5='Indian Health Service'\n 6='Department of Veterans Affairs'\n 7='Other'\n\n VALUE FOCUS\n 1='Mental health treatment'\n 2='Substance abuse treatment'\n 3='Mix of mental health and substance abuse treatment (neither is primary)'\n 4='General health care'\n 5='Other service focus';\n\n**Apply formats to dataset;\ndata processed;\n set raw;\n\n /*Assign formats*/\n format YOUNGADULTS TREATPSYCHOTHRPY TREATTRAUMATHRPY YESNO. FOCUS FOCUS. PUBLIC PUBLIC.;\n /*Add variable labels*/\n label YOUNGADULTS=\"Accepts young adults (aged 18-25 years old) for Tx\"\n TREATPSYCHOTHRPY=\"Facility offers individual psychotherapy\"\n TREATTRAUMATHRPY=\"Facility offers trauma therapy\"\n FOCUS=\"Primary treatment focus of facility\"\n PUBLIC=\"Public agency or department that operates facility\";\nrun;\n
This second script called my_output.sas
is the filled out template. Note the %INCLUDE
function that calls my_existing_sas_workflow.sas
/*1. Read in data file without value labels and run full code. \n Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/*/\n\n%INCLUDE \"myprojectfolder/my_existing_workflow.sas\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file.*/\nlibname out \"myprojectfolder/output\";\n\n/*2b. Output the format catalog.\n Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n copy out=out.FORMATS;\n run;\n\n/*3. Output the data file (sas7bdat) to your output folder*/\ndata out.yourdata;\n set processed;\n run;\n
"},{"location":"vlmd/extract/sas/#run-the-vlmd-command","title":"Run the vlmd
command","text":"After creating the necessary sas7bdat
and sas7bcat
files, you can then run the vlmd
command. The tool, will automatically detect the sas7bcat file if located in the same directory as your data file. If not detected, the command will run without the sas7bcat catalog file and the encodings
(i.e., value labels) will not be extracted from the catalog file.
vlmd extract --inputtype sas input/data.sas7bdat
"},{"location":"vlmd/extract/spss/","title":"SPSS .sav
files","text":"For SPSS users, the HEAL Data Utilities generates HEAL-compliant data dictionaries from SPSS's default file format for storing datasets: a SAV
file. It stores not only the data itself but also metadata such as variable names, variable labels, types, and value labels. The HEAL Data Utilities extracts these data and metadata to create HEAL-compliant data dictionaries.
vlmd
command","text":"vlmd extract --inputtype spss data/example_pyreadstat_output.sav
"},{"location":"vlmd/extract/stata/","title":"Stata .dta
files","text":"For Stata users, the HEAL Data Utilities generates HEAL-compliant data dictionaries through Stata's default file format: a DTA
file. DTA
files store not only the data itself but also metadata such as variable names, variable labels, types, and value labels.
vlmd
command","text":"vlmd extract --inputtype stata data/mydatafile.dta
"},{"location":"vlmd/schemas/","title":"HEAL data dictionary schemas","text":"Click on each data dictionary schema below to view information about each format's data dictionary properties (such as a description, examples, etc).
CSV fields
JSON data dictionary
Note
enum
type means that a field can only be one of a certain set of possible values.
Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.
Note, only name
and description
are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root moduleType: string
The section, form, survey instrument, set of measures or other broad category used to group variables.
Examples:\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\nname Required root nameType: string
The name of a variable (i.e., field) as it appears in the data.
[Required]
title root titleType: stringThe human-readable title or label of the variable.
[Highly recommended]
Example:\"My Variable (for name of my_variable)\"\ndescription Required root descriptionType: string
An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).
[Required]
Examples:\"Definition\"\n
\"Question text (if a survey)\"\ntype root typeType: enum (of string)
A classification or category of a particular data element or property expected or allowed in the dataset.
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., \\\"test\\\"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., \\\"2023-05-25\\\"))datetime
(A specific date and time, including timezone information. (e.g., \\\"2023-05-25T10:30:00Z\\\"))time
(A specific time of day. (e.g., \\\"10:30:00\\\"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., \\\"2023-05\\\"))duration
(A length of time. (e.g., \\\"PT1H\\\")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification
Each format is dependent on the type
specified. For example: If type
is \"string\", then see the String formats. If type
is one of the date-like formats, then see Date formats.
A format for a date variable (date
,time
,datetime
). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}
, which MUST
follow the date formatting syntax of C / Python strftime.
\\nExamples:
%Y-%m-%d
(for date, e.g., 2023-05-25) %Y%-%d
(for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S
(for time, e.g., 10:30:45) %H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
The two types of formats for geopoint
(describing a geographic point).
A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.
root format anyOf Geopoint Format oneOf item 1Type: objectContains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.
root format anyOf geojsonType: enum (of string)The JSON object according to the geojson spec.
Must be one of:Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.
[Optional,if applicable]
constraints.enum root constraints.enumType: stringConstrains possible values to a set of values.
[Optional,if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
constraints.pattern root constraints.patternType: string A regular expression pattern the data MUST conform to.
[Optional,if applicable]
constraints.maximum root constraints.maximumType: integerSpecifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.
[Optional,if applicable]
encodings root encodingsType: stringVariable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.
Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).
[Optional,if applicable]
Must match regular expression:^(?:.*?=.*?(?:\\||$))+$
Examples: \"0=No|1=Yes\"\n
\"HW=Hello world|GBW=Good bye world|HM=Hi,Mike\"\nordered root orderedType: boolean
Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).
[Optional,if applicable]
missingValues root missingValuesType: stringA list of missing values specific to a variable.
[Optional, if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
trueValues root trueValuesType: string For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.
[Optional, if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
Examples: \"Required|REQUIRED\"\n
\"required|Yes|Y|Checked\"\n
\"Checked\"\n
\"Required\"\nfalseValues root falseValuesType: string
For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
repo_link root repo_linkType: string A link to the variable as it exists on the home repository, if applicable
cde_id.source root cde_id.sourceType: string cde_id.id root cde_id.idType: string ontology_id.relation root ontology_id.relationType: string ontology_id.source root ontology_id.sourceType: string ontology_id.id root ontology_id.idType: string standardsMappings.type root standardsMappings.typeType: stringThe type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Examples:\"cde\"\n
\"ontology\"\n
\"reference_list\"\nstandardsMappings.label root standardsMappings.labelType: string
A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.
[Autopopulated, if not filled]
Examples:\"substance use\"\n
\"chemical compound\"\n
\"promis\"\nstandardsMappings.url root standardsMappings.urlType: stringFormat: uri
The url that links out to the published, standardized mapping.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nstandardsMappings.source root standardsMappings.sourceType: string
The source of the standardized variable.
Example:\"TBD (will have controlled vocabulary)\"\nstandardsMappings.id root standardsMappings.idType: string
The id locating the individual mapping within the given source.
relatedConcepts.type root relatedConcepts.typeType: stringThe type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
relatedConcepts.label root relatedConcepts.labelType: stringA free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
relatedConcepts.url root relatedConcepts.urlType: stringFormat: uriThe url that links out to the published, standardized concept.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nrelatedConcepts.source root relatedConcepts.sourceType: string
The source of the related concept.
[Autopopulated, if not filled]
Example:\"TBD (will have controlled vocabulary)\"\nrelatedConcepts.id root relatedConcepts.idType: string
The id locating the individual mapping within the given source.
[Autopopulated, if not filled]
univarStats.median root univarStats.medianType: number univarStats.mean root univarStats.meanType: number univarStats.std root univarStats.stdType: number univarStats.min root univarStats.minType: number univarStats.max root univarStats.maxType: number univarStats.mode root univarStats.modeType: number univarStats.count root univarStats.countType: integerValue must be greater or equal to 0
Additional Properties of any type are allowed.
root additionalPropertiesType: objectGenerated using json-schema-for-humans on 2023-07-05 at 17:11:06 -0500
"},{"location":"vlmd/schemas/json-data-dictionary/","title":"JSON data dictionary","text":"Variable Level Metadata (Data Dictionaries) Variable Level Metadata (Data Dictionaries) Type: objectThis schema defines the variable level metadata for one data dictionary for a given study.Note a given study can have multiple data dictionaries
title Required root titleType: string description root descriptionType: string data_dictionary Required root data_dictionaryType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata FieldsType: objectVariable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.
Note, only name
and description
are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root data_dictionary HEAL Variable Level Metadata Fields moduleType: string
The section, form, survey instrument, set of measures or other broad category used to group variables.
Examples:\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\nname Required root data_dictionary HEAL Variable Level Metadata Fields nameType: string
The name of a variable (i.e., field) as it appears in the data.
[Required]
title root data_dictionary HEAL Variable Level Metadata Fields titleType: stringThe human-readable title or label of the variable.
[Highly recommended]
Example:\"My Variable (for name of my_variable)\"\ndescription Required root data_dictionary HEAL Variable Level Metadata Fields descriptionType: string
An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).
[Required]
Examples:\"Definition\"\n
\"Question text (if a survey)\"\ntype root data_dictionary HEAL Variable Level Metadata Fields typeType: enum (of string)
A classification or category of a particular data element or property expected or allowed in the dataset.
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., \\\"test\\\"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., \\\"2023-05-25\\\"))datetime
(A specific date and time, including timezone information. (e.g., \\\"2023-05-25T10:30:00Z\\\"))time
(A specific time of day. (e.g., \\\"10:30:00\\\"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., \\\"2023-05\\\"))duration
(A length of time. (e.g., \\\"PT1H\\\")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification
Each format is dependent on the type
specified. For example: If type
is \"string\", then see the String formats. If type
is one of the date-like formats, then see Date formats.
A format for a date variable (date
,time
,datetime
). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}
, which MUST
follow the date formatting syntax of C / Python strftime.
\\nExamples:
%Y-%m-%d
(for date, e.g., 2023-05-25) %Y%-%d
(for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S
(for time, e.g., 10:30:45) %H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
The two types of formats for geopoint
(describing a geographic point).
A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.
root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 1Type: objectContains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.
root data_dictionary HEAL Variable Level Metadata Fields format anyOf geojsonType: enum (of string)The JSON object according to the geojson spec.
Must be one of:Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.
[Optional,if applicable]
enum root data_dictionary HEAL Variable Level Metadata Fields constraints enumType: arrayConstrains possible values to a set of values.
[Optional,if applicable]
pattern root data_dictionary HEAL Variable Level Metadata Fields constraints patternType: stringA regular expression pattern the data MUST conform to.
[Optional,if applicable]
maximum root data_dictionary HEAL Variable Level Metadata Fields constraints maximumType: integerSpecifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.
[Optional,if applicable]
encodings root data_dictionary HEAL Variable Level Metadata Fields encodingsType: objectVariable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.
Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).
[Optional,if applicable]
Examples:{\n\"0\": \"No\",\n\"1\": \"Yes\"\n}\n
{\n\"HW\": \"Hello world\",\n\"GBW\": \"Good bye world\",\n\"HM\": \"Hi, Mike\"\n}\nordered root data_dictionary HEAL Variable Level Metadata Fields orderedType: boolean
Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).
[Optional,if applicable]
missingValues root data_dictionary HEAL Variable Level Metadata Fields missingValuesType: arrayA list of missing values specific to a variable.
[Highly recommended]
trueValues root data_dictionary HEAL Variable Level Metadata Fields trueValuesType: array of stringFor boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.
[Optional, if applicable]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields trueValues trueValues itemsType: string Examples:\"Required\"\n
\"REQUIRED\"\n
\"required\"\n
\"Yes\"\n
\"Checked\\\"\"\nfalseValues root data_dictionary HEAL Variable Level Metadata Fields falseValuesType: array
For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.
repo_link root data_dictionary HEAL Variable Level Metadata Fields repo_linkType: stringA link to the variable as it exists on the home repository, if applicable
cde_id root data_dictionary HEAL Variable Level Metadata Fields cde_idType: array of object[FUTURE WARNING: WILL BE DEPRECATED] Use standardsMapping
. The source and id for the NIH Common Data Elements program.
[FUTURE WARNING: WILL BE DEPRECATED] - Use relatedConcepts
. Ontological information for the given variable as indicated by the source, id, and relation to the specified classification. One or more ontology classifications can be specified.
A published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items typeType: stringThe type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Examples:\"cde\"\n
\"ontology\"\n
\"reference_list\"\nlabel root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items labelType: string
A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.
[Autopopulated, if not filled]
Examples:\"substance use\"\n
\"chemical compound\"\n
\"promis\"\nurl root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items urlType: stringFormat: uri
The url that links out to the published, standardized mapping.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nsource root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items sourceType: string
The source of the standardized variable.
Example:\"TBD (will have controlled vocabulary)\"\nid root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items idType: string
The id locating the individual mapping within the given source.
relatedConcepts root data_dictionary HEAL Variable Level Metadata Fields relatedConceptsType: array of objectMappings to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc) [Autopopulated, if not filled]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items typeType: stringThe type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
label root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items labelType: stringA free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
url root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items urlType: stringFormat: uriThe url that links out to the published, standardized concept.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nsource root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items sourceType: string
The source of the related concept.
[Autopopulated, if not filled]
Example:\"TBD (will have controlled vocabulary)\"\nid root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items idType: string
The id locating the individual mapping within the given source.
[Autopopulated, if not filled]
univarStats root data_dictionary HEAL Variable Level Metadata Fields univarStatsType: objectUnivariate statistics inferred from the data about the given variable
[Experimental]
median root data_dictionary HEAL Variable Level Metadata Fields univarStats medianType: number mean root data_dictionary HEAL Variable Level Metadata Fields univarStats meanType: number std root data_dictionary HEAL Variable Level Metadata Fields univarStats stdType: number min root data_dictionary HEAL Variable Level Metadata Fields univarStats minType: number max root data_dictionary HEAL Variable Level Metadata Fields univarStats maxType: number mode root data_dictionary HEAL Variable Level Metadata Fields univarStats modeType: number count root data_dictionary HEAL Variable Level Metadata Fields univarStats countType: integerValue must be greater or equal to 0
Additional Properties of any type are allowed.
root data_dictionary HEAL Variable Level Metadata Fields additionalPropertiesType: objectGenerated using json-schema-for-humans on 2023-07-03 at 09:08:41 -0500
"},{"location":"vlmd/start/","title":"Start
from a template","text":"Some folks may prefer to create their HEAL data dictionary from scratch. To support this, we have created a utility that creates either a json or csv template.
Warning
Currently, the command is template
but will change to start
to be consistent with the verb subcommand vocabulary.
csv
template","text":"The HEAL Data Utilities can also input a csv
HEAL data dictionary either from a manually filled out template or as an additional step after further annotation (e.g., from the csv
HEAL data dictionary output of the other file formats).
To create a template csv
version with 10 fields (variables):
vlmd template myhealdd.csv --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.csv\"),numfields=10)\n
Click here to download an example of a filled out csv HEAL data dictionary template
"},{"location":"vlmd/start/#json-template","title":"json
template","text":"While the csv
HEAL data dictionary provides a tabular format for HEAL-compliant data dictionaries, ultimately, these csv data dictionary files are converted to a json file (the most common format to store and exchange data within web applications such as the HEAL Data Platform).
Another advantage of json
HEAL data dictionaries is that one can specify metadata describing the data dictionary as a whole (e.g., the description
and title
).
To create a template json
version with 10 fields (variables):
vlmd template myhealdd.json --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.json\"),numfields=10)\n
Click here to download an example of filled out json HEAL data dictionary template
"},{"location":"vlmd/validate/","title":"Validate
Check (validate) an existing HEAL data dictionary file","text":"Will indicate if the data dictionary complies with the HEAL specifications.
Command line interface (CLI)Pythonvlmd validate data/myhealcsvdd.csv\n\nvlmd validate data/myhealjsondd.json\n
from healdata_utils import validate_vlmd_csv,validate_vlmd_json\n\nvalidate_vlmd_csv(\"data/myhealcsvdd.csv\")\n\nvalidate_vlmd_json(\"data/myhealjsondd.json\")\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"HEAL Data Utilities","text":"The HEAL Data Utilities python package provides data packaging tools for the HEAL Data Ecosystem to facilitate data discovery, sharing, and harmonization on the HEAL Platform.
Currently, the focus of this repository is generating standardized variable level metadata (VLMD) in the form of data dictionaries. See the quick start section to get started without installing any of the prerequisites. (Click here for the Variable-level Metadata documentation section).
However, in the future, this will be expanded for all HEAL-specific data packaging functions (e.g., study- and file-level metadata and data).
"},{"location":"#quick-start","title":"Quick start","text":"Note
If using the quick start option, no prerequisites are required.
Double click on the vlmd
(or vlmd.exe
) executable or run the vlmd
executable without any arguments to quickly start using this tool. This \"quick start\" will take walk you through step by step by prompting you of the various options.
Important
Stand alone applications for different operating systems are available here. These allow you to run the vlmd
tool without needing to install anything else. Just (1) download, (2) unzip, and (3) double click on the vlmd
application icon.
While the HEAL Data Utilities should be compatible with most versions of Python, you can download the latest version of Python here and install it on your local computer. We recommend installing Python version 3.10.
"},{"location":"#installation","title":"Installation","text":"To install the latest official release of healdata-utils, from your computer's command prompt, run:
pip install healdata-utils
OR for the most up-to-date unreleased version run:
pip install git+https://github.com/norc-heal/healdata-utils.git
Note
Installing the unreleased version requires having git
software installed.
Variable level metadata (VLMD), in the form of standardized data dictionaries, provides an exciting opportunity:
For an example of this searchability in the context of study level metadata, see the platform's discovery page
When data is available, VLMD provides a way to validate the data as well.
Supports HEAL projects and goals such as the common data elements program
extract
: Extract the variable level metadata from an existing file with a specific type/format
start
: Start a data dictionary from an empty template
validate
: Check (validate) an existing HEAL data dictionary file to see if it follows the HEAL specifications after filling out a template or further annotation after extracting from a different format.
Typical workflows for creating a HEAL-compliant data dictionary include:
Create your data dictionary
(a) Run the vlmd extract
command (or convert_to_vlmd
if in python) to generate a HEAL-compliant data dictionary via your desired input format
(b) Run the vlmd template
command to start from an empty template.
Add/annotate with additional information in your preferred HEAL data dictionary format (either json
or csv
).
csv
data dictionaryjson
data dictionaryRun the vlmd validate
command with your HEAL data dictioanry as the input to validate.
Repeat (2) and (3) until you are ready to submit. Please note, currently only name
and description
are required.
Important
The main difference* between the CSV and JSON definitions lies in the way the data dictionaries are structured and the additional metadata included in the JSON data dictionary.
The CSV data dictionary is a plain tabular representation with no additional metadata, while the JSON dataset includes fields along with additional metadata in the form of a root description and title.
For more information on variable-level metadata properties (fields), see the csv
field specification and json
data dictionary specification.
Extract
VLMD from another data type and format","text":"The healdata-utils variable-level metadata (vlmd) tool inputs a variety of different input file types and extracts HEAL-compliant data dictionaries (JSON and CSV formats). Additionally, exported validation (i.e., \"error\") reports provide the user information as to a) if the exported data dictionary is valid according to HEAL specifications and b) how to modify one's data dictionary to make it HEAL-compliant.
Warning
Currently the python subcommand is convert
but will be changed to extract_to_vlmd
to be consistent with CLI. extract
was chosen to better reflect the functionality.
vlmd extract --inputtype spss myproject/myfile.sav\n
Note
To continue, it's recommended to go to the input types and formats. Also, for more details on the different flags/options, run vlmd --help
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myproject/myfile.sav\",inputtype=\"spss\")\n
Note
To continue, it's recommended to go to the input types and formats. For a complete set of options with convert_to_vlmd
see the docstring (if in a notebook, one can enter convert_to_vlmd?
)
This section provides the specific syntax for running each of the supported types for generating HEAL-compliant data dictionaries are listed. Additional instructions on how to obtain the necessary input files/software are also provided.
Note
To further annotate your outputted data dictionaries, see the variable-level metadata field properties (with examples) for either the csv data dictionary
click here or the json data dictionary
click here.
Extract variable level metadata from your data:
Both the python and command line routes will result in a JSON and CSV version of the HEAL data dictionary in the output folder along with the validation reports in the errors
folder. See below:
errors/heal-csv-errors.json
: outputted validation report for table in csv file against frictionless schemaIf valid, this file will contain:
{\n\"valid\": true,\n\"errors\": []\n}\n
- errors/heal-json-errors.json
: outputted jsonschema validation report. {\n\"valid\": true,\n\"errors\": []\n}\n
If no outputdir
specified, the resulting HEAL-compliant data dictionaries will be named:
heal-csvtemplate-data-dictionary.csv
: This is the CSV data dictionaryheal-jsontemplate-data-dictionary.json
: This is the JSON version of the data dictionarycsv
Datasets","text":"CSV (comma-separated values) is the main open tabular data format for storage and exchange. It is easy to create and understand using basic text editors in addition to popular spreadsheet software like Google Sheets and Excel. Importantly, CSVs are simple and can be easily integrated into web applications and just about any software.
Currently, the HEAL Data Utilities vlmd
function can infer a minimal, HEAL-compliant dataset by inferring name
, type
, and enum
(i.e., possible values). After this minimal data dictionary is generated, the researcher can further annotate it with fields' description
and other optional properties in either the HEAL-compliant csv
- or json
-formatted data dictionary (see the HEAL data dictionary template sections below for more information).
Excel workbooks contain tabular data tables across named worksheets.
This vlmd extraction tool provides the ability to extract vlmd from all of these worksheets either as a combined data dictionary or as multiple data dictionaries.
"},{"location":"vlmd/extract/exceldata/#run-the-vlmd-command","title":"Run thevlmd
command","text":"CLIPython vlmd extract --inputtype excel-data myexcelfile.xlsx\n
"},{"location":"vlmd/extract/exceldata/#to-output-multiple-sheets-as-separate-data-dictionaries","title":"To output multiple sheets as separate data dictionaries","text":"from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(input_filepath=\"myexcelfile.xlsx\",inputtype=\"excel-data\")\n
"},{"location":"vlmd/extract/exceldata/#to-extract-multiple-sheets-as-one-data-dictionary","title":"To extract multiple sheets as one data dictionary","text":"Note
Be careful about using the multiple_data_dicts=False
. In most instances, one sheet should correspond to one separate data table and thus have one corresponding data dictionary.
Note, this combines (ie concatenates all data tables) and then infers fields. This use case is when sheets are viewed as \"chunks\" of one resource/dataset.
from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n input_filepath=\"myexcelfile.xlsx\",\n inputtype=\"excel-data\",\n multiple_data_dicts=False\n )\n
"},{"location":"vlmd/extract/exceldata/#to-extract-a-subset-of-sheets-as-one-data-dictionary","title":"To extract a subset of sheets as one data dictionary","text":"from healdata_utils import convert_to_vlmd\n\nconvert_to_vlmd(\n input_filepath=\"myexcelfile.xlsx\",\n inputtype=\"excel-data\",\n multiple_data_dicts=False,\n sheet_name=[\"mysheet1\",\"mysheet2\"]\n )\n
"},{"location":"vlmd/extract/frictionlessschema/","title":"Frictionless Table Schema","text":"While vlmd specifications are designed (and still being developed), to support interoperability with the heal platform, minor naming translations may be needed. This function supports any of said translations (eg., frictionless fields
name --> heal data_dictionary
)
Note, this conversion supports either yaml
or json
format (currently only tests for json
format but should work with yaml).
Below are the official frictionless table schema specifications, which you will notice a high degree of overlap with the heal variable level metadata specifications.
See here for the frictionless table schema specs
"},{"location":"vlmd/extract/frictionlessschema/#run-the-vlmd-command","title":"Run thevlmd
command","text":"vlmd extract --inputtype frictionless data/frictionless_dataset1.frictionless.schema.json\n
"},{"location":"vlmd/extract/redcapcsv/","title":"REDCap: Data Dictionary CSV Export","text":"For users collecting data in a REDCap data management system, HEAL-compliant data dictionaries can be generated directly from REDCap exports.
The REDCap data dictionary export serves the purpose of providing variable-level metadata in a standardized, tabular format and is generally easy to export. The HEAL data utilities leverages this user experience and standardized format to enable HEAL researchers to generate a Heal-compliant data dictionary.
"},{"location":"vlmd/extract/redcapcsv/#export-your-redcap-data-dictionary","title":"Export your Redcap data dictionary","text":"To download a REDCap CSV export, do the following*:
Data dictionary
page. A link to this page may be available on the project side bar (see image below) or in the Project Setup tab
at the top of your page.Data dictionary
page, click on Download the current data dictionary
to export the dictionary (see below).*there may be slight differences depending on your specific REDCap instance and version
"},{"location":"vlmd/extract/redcapcsv/#run-the-vlmd-command","title":"Run thevlmd
command","text":"vlmd extract --inputtype redcap input/example_redcap_demo.redcap.csv
"},{"location":"vlmd/extract/sas/","title":"SAS sas7bdat
(and sas7bcat
) files","text":"To accommodate SAS users, the HEAL Data Utilities supports the binary sas7bdat
file format, which contains the actual data values (observations/records). This file also includes variable metadata (variable names
and variable labels/ descriptions
).
The HEAL Data Utilities also provides the option to include a catalog file \u2013 sas7bcat
format - with the sas7bdat
. A sas7bcat
file contains variable value labels, or encodings
, that can be mapped onto the corresponding data from a sas7bdat
file.
sas7bdat
and a sas7bcat
file","text":"Many SAS users build formats and labels into their data processing and analysis scripts. In this section, we provide syntax that can be easily copy-pasted into these existing workflows to create sas7bdat
and sas7bcat
files to input into the vlmd
tool.
This script template can be run separately or inserted directly at the end of a SAS user's workflow.
Note
If inserted directly, remember to delete the lines with %INCLUDE
)
/*1. Read in data file without value labels and run full code. \n Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/\n\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n%INCLUDE \"<INSERT SAS SCRIPT HERE FILE PATH HERE>\"; /* THIS WILL RUN A SECOND SEPARATE SAS SCRIPT*/ \n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file. If you already have an out directory assigned, skip this step and replace \u201cout\u201d with your out directory libname in the flow*/\n\nlibname out \"<INSERT THE DESIRED LOCATION (FILE PATH) TO YOUR SAS7BCAT AND SAS7BDAT FILES HERE>\";\n\n/*2b. Output the format catalog.\n Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n copy out=out.FORMATS;\n run;\n\n/*3. Output the data file (sas7bdat) */\ndata out.yourdata;\n set <INSERT THE NAME OF YOUR FINAL SAS DATASET HERE>;\n run;\n
The below SAS syntax is an example of how to use the template within your SAS workflow.
The below sample script creates all of our variable and value labels. Your workflow may include multiple SAS scripts with multiple format statements and may include analyses and other PROC calls for data exploration, but for demonstration purposes, this example only uses one script and focuses on defining the variable and value labels.
Example my_existing_sas_workflow.sas/*1. Read in input data */\nproc import datafile=\"myprojectfolder/input/mydata.csv\"\n out=raw\n dbms=csv replace;\n getnames=yes;\nrun;\n\n/*2. Set up proc format and apply formats and variable labels in data step */\n/*Create encodings (value labels)*/\nproc format;\n VALUE YESNO\n 0 =\"No\"\n 1 =\"Yes\"\n\n VALUE PUBLIC\n 1='State mental health authority (SMHA)'\n 2='Other state government agency or department'\n 3='Regional/district authority or county, local, or municipal government'\n 4='Tribal government'\n 5='Indian Health Service'\n 6='Department of Veterans Affairs'\n 7='Other'\n\n VALUE FOCUS\n 1='Mental health treatment'\n 2='Substance abuse treatment'\n 3='Mix of mental health and substance abuse treatment (neither is primary)'\n 4='General health care'\n 5='Other service focus';\n\n**Apply formats to dataset;\ndata processed;\n set raw;\n\n /*Assign formats*/\n format YOUNGADULTS TREATPSYCHOTHRPY TREATTRAUMATHRPY YESNO. FOCUS FOCUS. PUBLIC PUBLIC.;\n /*Add variable labels*/\n label YOUNGADULTS=\"Accepts young adults (aged 18-25 years old) for Tx\"\n TREATPSYCHOTHRPY=\"Facility offers individual psychotherapy\"\n TREATTRAUMATHRPY=\"Facility offers trauma therapy\"\n FOCUS=\"Primary treatment focus of facility\"\n PUBLIC=\"Public agency or department that operates facility\";\nrun;\n
This second script called my_output.sas
is the filled out template. Note the %INCLUDE
function that calls my_existing_sas_workflow.sas
/*1. Read in data file without value labels and run full code. \n Note: The most important pieces to run here are the PROC FORMAT statement(s) and any data steps \n that assign formats and variable labels which are needed for the data dictionary. You may have defined variable labels and values in separate scripts for different analyses. In order to capture all your defined variable labels and values across scripts, you will need an %INCLUDE statement for each SAS script that defines unique variable labels or value labels.*/*/\n\n%INCLUDE \"myprojectfolder/my_existing_workflow.sas\"; /* THIS WILL RUN A SEPARATE SAS SCRIPT*/\n\n/*2. Output the format catalog (sas7bcat) */\n/*2a. If you do not have an out directory, assign one to output the SAS catalog and data file.*/\nlibname out \"myprojectfolder/output\";\n\n/*2b. Output the format catalog.\n Note: The format catalog is automatically stored in work.formats. This step copies the format file to the \n out directory as a sas7bcat file.*/\nproc catalog cat=work.FORMATS;\n copy out=out.FORMATS;\n run;\n\n/*3. Output the data file (sas7bdat) to your output folder*/\ndata out.yourdata;\n set processed;\n run;\n
"},{"location":"vlmd/extract/sas/#run-the-vlmd-command","title":"Run the vlmd
command","text":"After creating the necessary sas7bdat
and sas7bcat
files, you can then run the vlmd
command. The tool, will automatically detect the sas7bcat file if located in the same directory as your data file. If not detected, the command will run without the sas7bcat catalog file and the encodings
(i.e., value labels) will not be extracted from the catalog file.
vlmd extract --inputtype sas input/data.sas7bdat
"},{"location":"vlmd/extract/spss/","title":"SPSS .sav
files","text":"For SPSS users, the HEAL Data Utilities generates HEAL-compliant data dictionaries from SPSS's default file format for storing datasets: a SAV
file. It stores not only the data itself but also metadata such as variable names, variable labels, types, and value labels. The HEAL Data Utilities extracts these data and metadata to create HEAL-compliant data dictionaries.
vlmd
command","text":"vlmd extract --inputtype spss data/example_pyreadstat_output.sav
"},{"location":"vlmd/extract/stata/","title":"Stata .dta
files","text":"For Stata users, the HEAL Data Utilities generates HEAL-compliant data dictionaries through Stata's default file format: a DTA
file. DTA
files store not only the data itself but also metadata such as variable names, variable labels, types, and value labels.
vlmd
command","text":"vlmd extract --inputtype stata data/mydatafile.dta
"},{"location":"vlmd/schemas/","title":"HEAL data dictionary schemas","text":"Click on each data dictionary schema below to view information about each format's data dictionary properties (such as a description, examples, etc).
CSV fields
JSON data dictionary
Note
enum
type means that a field can only be one of a certain set of possible values.
Variable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.
Note, only name
and description
are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root moduleType: string
The section, form, survey instrument, set of measures or other broad category used to group variables.
Examples:\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\nname Required root nameType: string
The name of a variable (i.e., field) as it appears in the data.
[Required]
title root titleType: stringThe human-readable title or label of the variable.
[Highly recommended]
Example:\"My Variable (for name of my_variable)\"\ndescription Required root descriptionType: string
An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).
[Required]
Examples:\"Definition\"\n
\"Question text (if a survey)\"\ntype root typeType: enum (of string)
A classification or category of a particular data element or property expected or allowed in the dataset.
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., \\\"test\\\"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., \\\"2023-05-25\\\"))datetime
(A specific date and time, including timezone information. (e.g., \\\"2023-05-25T10:30:00Z\\\"))time
(A specific time of day. (e.g., \\\"10:30:00\\\"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., \\\"2023-05\\\"))duration
(A length of time. (e.g., \\\"PT1H\\\")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification
Each format is dependent on the type
specified. For example: If type
is \"string\", then see the String formats. If type
is one of the date-like formats, then see Date formats.
A format for a date variable (date
,time
,datetime
). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}
, which MUST
follow the date formatting syntax of C / Python strftime.
\\nExamples:
%Y-%m-%d
(for date, e.g., 2023-05-25) %Y%-%d
(for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S
(for time, e.g., 10:30:45) %H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
The two types of formats for geopoint
(describing a geographic point).
A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.
root format anyOf Geopoint Format oneOf item 1Type: objectContains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.
root format anyOf geojsonType: enum (of string)The JSON object according to the geojson spec.
Must be one of:Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.
[Optional,if applicable]
constraints.enum root constraints.enumType: stringConstrains possible values to a set of values.
[Optional,if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
constraints.pattern root constraints.patternType: string A regular expression pattern the data MUST conform to.
[Optional,if applicable]
constraints.maximum root constraints.maximumType: integerSpecifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.
[Optional,if applicable]
encodings root encodingsType: stringVariable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.
Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).
[Optional,if applicable]
Must match regular expression:^(?:.*?=.*?(?:\\||$))+$
Examples: \"0=No|1=Yes\"\n
\"HW=Hello world|GBW=Good bye world|HM=Hi,Mike\"\nordered root orderedType: boolean
Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).
[Optional,if applicable]
missingValues root missingValuesType: stringA list of missing values specific to a variable.
[Optional, if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
trueValues root trueValuesType: string For boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.
[Optional, if applicable]
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
Examples: \"Required|REQUIRED\"\n
\"required|Yes|Y|Checked\"\n
\"Checked\"\n
\"Required\"\nfalseValues root falseValuesType: string
For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.
Must match regular expression:^(?:[^|]+\\||[^|]*)(?:[^|]*\\|)*[^|]*$
repo_link root repo_linkType: string A link to the variable as it exists on the home repository, if applicable
cde_id.source root cde_id.sourceType: string cde_id.id root cde_id.idType: string ontology_id.relation root ontology_id.relationType: string ontology_id.source root ontology_id.sourceType: string ontology_id.id root ontology_id.idType: string standardsMappings.type root standardsMappings.typeType: stringThe type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Examples:\"cde\"\n
\"ontology\"\n
\"reference_list\"\nstandardsMappings.label root standardsMappings.labelType: string
A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.
[Autopopulated, if not filled]
Examples:\"substance use\"\n
\"chemical compound\"\n
\"promis\"\nstandardsMappings.url root standardsMappings.urlType: stringFormat: uri
The url that links out to the published, standardized mapping.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nstandardsMappings.source root standardsMappings.sourceType: string
The source of the standardized variable.
Example:\"TBD (will have controlled vocabulary)\"\nstandardsMappings.id root standardsMappings.idType: string
The id locating the individual mapping within the given source.
relatedConcepts.type root relatedConcepts.typeType: stringThe type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
relatedConcepts.label root relatedConcepts.labelType: stringA free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
relatedConcepts.url root relatedConcepts.urlType: stringFormat: uriThe url that links out to the published, standardized concept.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nrelatedConcepts.source root relatedConcepts.sourceType: string
The source of the related concept.
[Autopopulated, if not filled]
Example:\"TBD (will have controlled vocabulary)\"\nrelatedConcepts.id root relatedConcepts.idType: string
The id locating the individual mapping within the given source.
[Autopopulated, if not filled]
univarStats.median root univarStats.medianType: number univarStats.mean root univarStats.meanType: number univarStats.std root univarStats.stdType: number univarStats.min root univarStats.minType: number univarStats.max root univarStats.maxType: number univarStats.mode root univarStats.modeType: number univarStats.count root univarStats.countType: integerValue must be greater or equal to 0
Additional Properties of any type are allowed.
root additionalPropertiesType: objectGenerated using json-schema-for-humans on 2023-07-05 at 17:11:06 -0500
"},{"location":"vlmd/schemas/json-data-dictionary/","title":"JSON data dictionary","text":"Variable Level Metadata (Data Dictionaries) Variable Level Metadata (Data Dictionaries) Type: objectThis schema defines the variable level metadata for one data dictionary for a given study.Note a given study can have multiple data dictionaries
title Required root titleType: string description root descriptionType: string data_dictionary Required root data_dictionaryType: array of object Each item of this array must be: root data_dictionary HEAL Variable Level Metadata FieldsType: objectVariable level metadata individual fields integrated into the variable level metadata object within the HEAL platform metadata service.
Note, only name
and description
are required. Listed at the end of the description are suggested \"priority\" levels in brackets (e.g., []): 1. [Required]: Needs to be filled out to be valid. 2. [Highly recommended]: Greatly help using the data dictionary but not required. 3. [Optional, if applicable]: May only be applicable to certain fields. 4. [Autopopulated, if not filled]: These fields are intended to be autopopulated from other fields but can be filled out if desired. 5. [Experimental]: These fields are not currently used but are in development. module root data_dictionary HEAL Variable Level Metadata Fields moduleType: string
The section, form, survey instrument, set of measures or other broad category used to group variables.
Examples:\"Demographics\"\n
\"PROMIS\"\n
\"Substance use\"\n
\"Medical History\"\n
\"Sleep questions\"\n
\"Physical activity\"\nname Required root data_dictionary HEAL Variable Level Metadata Fields nameType: string
The name of a variable (i.e., field) as it appears in the data.
[Required]
title root data_dictionary HEAL Variable Level Metadata Fields titleType: stringThe human-readable title or label of the variable.
[Highly recommended]
Example:\"My Variable (for name of my_variable)\"\ndescription Required root data_dictionary HEAL Variable Level Metadata Fields descriptionType: string
An extended description of the variable. This could be the definition of a variable or the question text (e.g., if a survey).
[Required]
Examples:\"Definition\"\n
\"Question text (if a survey)\"\ntype root data_dictionary HEAL Variable Level Metadata Fields typeType: enum (of string)
A classification or category of a particular data element or property expected or allowed in the dataset.
number
(A numeric value with optional decimal places. (e.g., 3.14))integer
(A whole number without decimal places. (e.g., 42))string
(A sequence of characters. (e.g., \\\"test\\\"))any
(Any type of data is allowed. (e.g., true))boolean
(A binary value representing true or false. (e.g., true))date
(A specific calendar date. (e.g., \\\"2023-05-25\\\"))datetime
(A specific date and time, including timezone information. (e.g., \\\"2023-05-25T10:30:00Z\\\"))time
(A specific time of day. (e.g., \\\"10:30:00\\\"))year
(A specific year. (e.g., 2023)yearmonth
(A specific year and month. (e.g., \\\"2023-05\\\"))duration
(A length of time. (e.g., \\\"PT1H\\\")geopoint
(A pair of latitude and longitude coordinates. (e.g., [51.5074, -0.1278]))A format taken from one of the frictionless specification schemas. For example, for tabular data, there is the Table Schema specification
Each format is dependent on the type
specified. For example: If type
is \"string\", then see the String formats. If type
is one of the date-like formats, then see Date formats.
A format for a date variable (date
,time
,datetime
). \\n\\t* default: An ISO8601 format string. \\n\\t* any: Any parsable representation of a date/time/datetime. The implementing library can attempt to parse the datetime via a range of strategies. \\n\\t* {PATTERN}: The value can be parsed according to {PATTERN}
, which MUST
follow the date formatting syntax of C / Python strftime.
\\nExamples:
%Y-%m-%d
(for date, e.g., 2023-05-25) %Y%-%d
(for date, e.g., 20230525) for date without dashes\" %Y-%m-%dT%H:%M:%S
(for datetime, e.g., 2023-05-25T10:30:45) %Y-%m-%dT%H:%M:%SZ
(for datetime with UTC timezone, e.g., 2023-05-25T10:30:45Z) %Y-%m-%dT%H:%M:%S%z
(for datetime with timezone offset, e.g., 2023-05-25T10:30:45+0300) %Y-%m-%dT%H:%M
(for datetime without seconds, e.g., 2023-05-25T10:30) %Y-%m-%dT%H
(for datetime without minutes and seconds, e.g., 2023-05-25T10) %H:%M:%S
(for time, e.g., 10:30:45) %H:%M:%SZ
(for time with UTC timezone, e.g., 10:30:45Z) %H:%M:%S%z
(for time with timezone offset, e.g., 10:30:45+0300)
The two types of formats for geopoint
(describing a geographic point).
A JSON array or a string parsable as a JSON array where each item is a number with the first as the latitude and the second as longitude.
root data_dictionary HEAL Variable Level Metadata Fields format anyOf Geopoint Format oneOf item 1Type: objectContains latitude and longitude with two keys (\"lat\" and \"long\") with number items mapped to each key.
root data_dictionary HEAL Variable Level Metadata Fields format anyOf geojsonType: enum (of string)The JSON object according to the geojson spec.
Must be one of:Indicates the maximum length of an iterable (e.g., array, string, or object). For example, if 'Hello World' is the longest value of a categorical variable, this would be a maxLength of 11.
[Optional,if applicable]
enum root data_dictionary HEAL Variable Level Metadata Fields constraints enumType: arrayConstrains possible values to a set of values.
[Optional,if applicable]
pattern root data_dictionary HEAL Variable Level Metadata Fields constraints patternType: stringA regular expression pattern the data MUST conform to.
[Optional,if applicable]
maximum root data_dictionary HEAL Variable Level Metadata Fields constraints maximumType: integerSpecifies the maximum value of a field (e.g., maximum -- or most recent -- date, maximum integer etc). Note, this is different then maxLength property.
[Optional,if applicable]
encodings root data_dictionary HEAL Variable Level Metadata Fields encodingsType: objectVariable value encodings provide a way to further annotate any value within a any variable type, making values easier to understand.
Many analytic software programs (e.g., SPSS,Stata, and SAS) use numerical encodings and some algorithms only support numerical values. Encodings (and mappings) allow categorical values to be stored as numerical values.
Additionally, as another use case, this field provides a way to store categoricals that are stored as \"short\" labels (such as abbreviations).
[Optional,if applicable]
Examples:{\n\"0\": \"No\",\n\"1\": \"Yes\"\n}\n
{\n\"HW\": \"Hello world\",\n\"GBW\": \"Good bye world\",\n\"HM\": \"Hi, Mike\"\n}\nordered root data_dictionary HEAL Variable Level Metadata Fields orderedType: boolean
Indicates whether a categorical variable is ordered. This variable is relevant for variables that have an ordered relationship but not necessarily a numerical relationship (e.g., Strongly disagree < Disagree < Neutral < Agree).
[Optional,if applicable]
missingValues root data_dictionary HEAL Variable Level Metadata Fields missingValuesType: arrayA list of missing values specific to a variable.
[Highly recommended]
trueValues root data_dictionary HEAL Variable Level Metadata Fields trueValuesType: array of stringFor boolean (true) variable (as defined in type field), this field allows a physical string representation to be cast as true (increasing readability of the field). It can include one or more values.
[Optional, if applicable]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields trueValues trueValues itemsType: string Examples:\"Required\"\n
\"REQUIRED\"\n
\"required\"\n
\"Yes\"\n
\"Checked\\\"\"\nfalseValues root data_dictionary HEAL Variable Level Metadata Fields falseValuesType: array
For boolean (false) variable (as defined in type field), this field allows a physical string representation to be cast as false (increasing readability of the field) that is not a standard false value. It can include one or more values.
repo_link root data_dictionary HEAL Variable Level Metadata Fields repo_linkType: stringA link to the variable as it exists on the home repository, if applicable
cde_id root data_dictionary HEAL Variable Level Metadata Fields cde_idType: array of object[FUTURE WARNING: WILL BE DEPRECATED] Use standardsMapping
. The source and id for the NIH Common Data Elements program.
[FUTURE WARNING: WILL BE DEPRECATED] - Use relatedConcepts
. Ontological information for the given variable as indicated by the source, id, and relation to the specified classification. One or more ontology classifications can be specified.
A published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items typeType: stringThe type of mapping linked to a published set of standard variables such as the NIH Common Data Elements program. [Autopopulated, if not filled]
Examples:\"cde\"\n
\"ontology\"\n
\"reference_list\"\nlabel root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items labelType: string
A free text label of a mapping indicating a mapping(s) to a published set of standard variables such as the NIH Common Data Elements program.
[Autopopulated, if not filled]
Examples:\"substance use\"\n
\"chemical compound\"\n
\"promis\"\nurl root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items urlType: stringFormat: uri
The url that links out to the published, standardized mapping.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nsource root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items sourceType: string
The source of the standardized variable.
Example:\"TBD (will have controlled vocabulary)\"\nid root data_dictionary HEAL Variable Level Metadata Fields standardsMappings standardsMappings items idType: string
The id locating the individual mapping within the given source.
relatedConcepts root data_dictionary HEAL Variable Level Metadata Fields relatedConceptsType: array of objectMappings to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc) [Autopopulated, if not filled]
Each item of this array must be: root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts itemsType: object type root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items typeType: stringThe type of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
label root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items labelType: stringA free text label of mapping to a published set of concepts related to the given field such as ontological information (eg., NCI thesaurus, bioportal etc)
[Autopopulated, if not filled]
url root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items urlType: stringFormat: uriThe url that links out to the published, standardized concept.
[Autopopulated, if not filled]
Example:\"https://cde.nlm.nih.gov/deView?tinyId=XyuSGdTTI\"\nsource root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items sourceType: string
The source of the related concept.
[Autopopulated, if not filled]
Example:\"TBD (will have controlled vocabulary)\"\nid root data_dictionary HEAL Variable Level Metadata Fields relatedConcepts relatedConcepts items idType: string
The id locating the individual mapping within the given source.
[Autopopulated, if not filled]
univarStats root data_dictionary HEAL Variable Level Metadata Fields univarStatsType: objectUnivariate statistics inferred from the data about the given variable
[Experimental]
median root data_dictionary HEAL Variable Level Metadata Fields univarStats medianType: number mean root data_dictionary HEAL Variable Level Metadata Fields univarStats meanType: number std root data_dictionary HEAL Variable Level Metadata Fields univarStats stdType: number min root data_dictionary HEAL Variable Level Metadata Fields univarStats minType: number max root data_dictionary HEAL Variable Level Metadata Fields univarStats maxType: number mode root data_dictionary HEAL Variable Level Metadata Fields univarStats modeType: number count root data_dictionary HEAL Variable Level Metadata Fields univarStats countType: integerValue must be greater or equal to 0
Additional Properties of any type are allowed.
root data_dictionary HEAL Variable Level Metadata Fields additionalPropertiesType: objectGenerated using json-schema-for-humans on 2023-07-03 at 09:08:41 -0500
"},{"location":"vlmd/start/","title":"Start
from a template","text":"Some folks may prefer to create their HEAL data dictionary from scratch. To support this, we have created a utility that creates either a json or csv template.
Warning
Currently, the command is template
but will change to start
to be consistent with the verb subcommand vocabulary.
csv
template","text":"The HEAL Data Utilities can also input a csv
HEAL data dictionary either from a manually filled out template or as an additional step after further annotation (e.g., from the csv
HEAL data dictionary output of the other file formats).
To create a template csv
version with 10 fields (variables):
vlmd template myhealdd.csv --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.csv\"),numfields=10)\n
Click here to download an example of a filled out csv HEAL data dictionary template
"},{"location":"vlmd/start/#json-template","title":"json
template","text":"While the csv
HEAL data dictionary provides a tabular format for HEAL-compliant data dictionaries, ultimately, these csv data dictionary files are converted to a json file (the most common format to store and exchange data within web applications such as the HEAL Data Platform).
Another advantage of json
HEAL data dictionaries is that one can specify metadata describing the data dictionary as a whole (e.g., the description
and title
).
To create a template json
version with 10 fields (variables):
vlmd template myhealdd.json --numfields 10\n
from healdata_utils import write_vlmd_template\n\nwrite_vlmd_template(tmpdir.joinpath(\"heal.json\"),numfields=10)\n
Click here to download an example of filled out json HEAL data dictionary template
"},{"location":"vlmd/validate/","title":"Validate
Check (validate) an existing HEAL data dictionary file","text":"Will indicate if the data dictionary complies with the HEAL specifications.
Command line interface (CLI)Pythonvlmd validate data/myhealcsvdd.csv\n\nvlmd validate data/myhealjsondd.json\n
from healdata_utils import validate_vlmd_csv,validate_vlmd_json\n\nvalidate_vlmd_csv(\"data/myhealcsvdd.csv\")\n\nvalidate_vlmd_json(\"data/myhealjsondd.json\")\n
"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 77def968778ef4a35e2ef438d57c1ec98c1226a5..71e91679102e7b2a40153c6612fedf0fa02319bc 100644
GIT binary patch
delta 15
Wcmcb>bb*OYzMF%iY2ijTFGc_-as-M1
delta 15
Wcmcb>bb*OYzMF%?YW_wxFGc_*xCB4|
diff --git a/vlmd/extract/exceldata/index.html b/vlmd/extract/exceldata/index.html
index 412a258..d92922d 100644
--- a/vlmd/extract/exceldata/index.html
+++ b/vlmd/extract/exceldata/index.html
@@ -942,20 +942,21 @@ from healdata_utils import convert_to_vlmd
convert_to_vlmd(
- filepath="myexcelfile.xlsx",
+ input_filepath="myexcelfile.xlsx",
inputtype="excel-data",
multiple_data_dicts=False
)
```python
-from healdata_utils import convert_to_vlmd
-convert_to_vlmd( - filepath="myexcelfile.xlsx", - inputtype="excel-data", - multiple_data_dicts=False, - sheet_name=["mysheet1","mysheet2"] - )
+from healdata_utils import convert_to_vlmd
+
+convert_to_vlmd(
+ input_filepath="myexcelfile.xlsx",
+ inputtype="excel-data",
+ multiple_data_dicts=False,
+ sheet_name=["mysheet1","mysheet2"]
+ )
+