-
Notifications
You must be signed in to change notification settings - Fork 0
Data Extraction
The standardized natural products vocabulary will support several use cases -
- Obtain all Latin binomials, common names, synonyms of a natural product term.
- Obtain variations of natural product names used in spontaneous reporting systems (such as FAERS).
- Provide standardized terms that can be used to generate pharmacovigilance signals at different levels of granularity, including the botanical natural product name, specific genus or species name, constituent(s), and combination products.
- Obtain all natural products that are part of combination products containing one or more natural products.
- The end results are concept records added to the OMOP/OHDSI standard vocabulary representing natural products and application of the natural products within the pharmacovigilance workflows.
- Obtain list of substances from the FDA UNII files. Substances in UNII list correspond to structurally diverse substances UUIDs in the Global Substance Registration System (GSRS).
- Natural products (NP) and NP constituents are extracted from GSRS
- NP common names manually curated from GSRS and Health Canada
- NP spelling variations mapped from FAERS original drug strings to natural product names (manually and automatically).
- Create OMOP/OHDSI standard vocabulary with new 'napdi' concepts
- NP vocabulary concepts are mapped to RxNorm concepts used in FAERS
- (Update 2023) - New FAERS 'HERBALS' strings added to vocabulary after manual mapping
- (Update 2023) - Combination products included in vocabulary and mapped to preferred NP terms
- Concepts and relationships in latest vocabulary version
The Global Substance Registration System (G-SRS), developed by the Ginas Project, is a software to assist agencies in registering and documenting information about substances found in medicines. It contains information about natural products (NPs) and their constituents. The database is available for download and local installation here.
In this project, we download and install G-SRS as a PostreSQL database. G-SRS contains 6 types of substances referenced in the ISO 11238 standard – chemicals, mixtures, polymers, proteins, nucleic acids and structurally diverse substances. Latin binomial names of NPs (such as Mitragyna speciosa (Kratom) and Cinnamomum verum (Cinnamon) are structurally diverse substances. We extract the substances using their Latin binomial names, parent substances, and parts of the substance (i.e. substances with Latin binomial names as parents).
Health Canada databases such as the Licensed Natural Health Products Database (LNHPD) contains natural product names and synonyms for Latin binomial names of the natural products.
-
Code: https://github.com/dbmi-pitt/NaPDI-pv/tree/master/np-terminology-imports/common-names
-
Input: a comprehensive list of latin binomials to search
-
Procedure: webprod_to_local_HTML.py obtains HTML output from the site by searching with the latin binomials. local_HTML_to_common_name.py parses the HTML to output a JSON file with clean latin binomial to common name mappings. The JSON file is converted to a TSV and loaded into the GSRS data base in the same schema as the tables that pull NP data (see above). Currently, the table is named lb_to_common_names_tsv. The file can then be manually edited to additional common names or they can be added when teh JSON is converted to TSV using convertToTsv.py.
RxNorm is generally used for normalized names for clinical drugs and mapping drug names in spontaneous reports to standardized codes. RxNorm also contains some natural product ingredients and drug forms that are used in this vocabulary to map natural product terms and identify spontaneous reports with standardized codes.
- Create table np_to_rxnorm with exact and substring matches for napdi vocabulary concepts (including constituents) to RxNorm terms - https://github.com/dbmi-pitt/np-terminology-imports/blob/main/scratch/np-vocabulary-mappings-rxnorm.sql.
- Filter by concepts that are used in FAERS reports then manually annotated combination products (more below).
- Create table np_to_rxnorm_annotated and include in vocabulary workflow.
- Include RxNorm mappings with relationships 'napdi_np_maps_to' and 'napdi_const_maps_to'.
- Many NPs in the reference set and RxNorm mappings are actually combinations of one or more NPs or NP ingredients and are included in the latest vocabulary version. Combination products refer to any product containing one or more natural products (e.g. cinnamon garlic).
- All combination products are manually marked based on an annotation guide in the NP spelling variations, HERBALS strings, and RxNorm concepts.
- These are included in the vocabulary with concept class ID 'NaPDI NP Combination Product' and relationships 'napdi_pt_to_combo' and 'napdi_combo_to_pt'.
-
Update GSRS NPs, common names, and constituents in lb_to_common_names_tsv, test_srs_np, and test_srs_np_constituent (currently in the scratch_sanya_2023 schema of the GSRS database). See GSRS database query notes.
-
update the manually curated NP spelling variations in np_faers_reference_set (currently in the scratch_sanya schema of the CEM database)
-
log into the database in an admin role and drop all prior NAPDI vocabulary concepts, relationships, and concept relationship mappings (see above)
-
run the SQL script NP terminology ETL
-
test that the vocabulary is working and makes sense using queries like the following:
NOTE: If an NP has multiple species, the L.B.s for the species can be mapped to different preferred terms. For example, Glycyrrhiza uralensis, Glycyrrhiza glabra, and Glycyrrhiza inflata all map to different P.T.s They are correct mappings in that they map to distinct common names found in GSRS. However, it means that our workflow will need to be as follows when we want to extract cases for a given LB with multiple species: list all of the L.Bs, query the vocab for the PT for each, use the PT concept ids as a concept set for the NP moving forward. That is not too bad and is similar to how we work with drugs.
NOTE: NP constituent spelling variations do not currently exist in the vocabulary addition. So, for example, Cannabidiol is in the vocab but CBD is not. This means that any work with constituents will need to consider spelling variations and add those to the study workflow
NOTE: both constituents and spelling variations can match to multiple NP preferred names as per the reference set and so a user/programmer needs to make sure not to have duplicate counts when running queries that reply on either.
Concept Class | Description |
---|---|
NaPDI Natural Product | Custom natural product terms in vocabulary (concept IDs < 0) |
NaPDI Preferred Term | Preferred term for each natural product |
NaPDI NP Spelling Variation | Curated spelling variations for each natural product term from FAERS |
NaPDI NP Constituent | Constituents of natural products extracted from GSRS |
NaPDI NP Combination Product | Combination products contain one or more natural products in the same term |
Relationship | Domain | Range |
---|---|---|
napdi_pt | Natural Product | NP Preferred Term |
napdi_is_pt_of | NP Preferred Term | Natural Product |
napdi_has_const | Natural Product | NP Constituent |
napdi_is_const_of | NP Constituent | Natural Product |
napdi_spell_vr | NP Spelling Variation | Natural Product |
napdi_is_spell_vr_of | Natural Product | NP Spelling Variation |
napdi_np_maps_to | NP Preferred Term | RxNorm code |
napdi_const_maps_to | NP Constituent | RxNorm code |
napdi_pt_to_combo | Preferred Term | NP Combination Product |
napdi_combo_to_pt | NP Combination Product | Preferred Term |