Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICEES KG meta-KG and Biolink mappings #12

Open
karafecho opened this issue Nov 7, 2022 · 9 comments
Open

ICEES KG meta-KG and Biolink mappings #12

karafecho opened this issue Nov 7, 2022 · 9 comments
Assignees

Comments

@karafecho
Copy link

karafecho commented Nov 7, 2022

This issue is to formally report a disconnect between the Biolink mappings that are included in the ICEES API all_features config files and those that support ICEES KG, as reported in the meta-KG. The approach that we've implemented to automate some of the work and leverage SRI services is not picking up certain intended Biolink mappings. For instance, AvgDailyPM2.5Exposure should map to biolink:ChemicalEntity and biolink:EnvironmentalExposure. To provide another example, TotalEDVisits should map to biolink:ClinicalIntervention.

@maximusunc
Copy link
Contributor

Hi Kara. What happens here is that we send the specified search term to name resolver and that gives back curies that match. Once we have curies, we get the corresponding biolink categories from node normalizer. I think if we are wanting specific biolink categories, we will either need to update the search term to be something that gives back a curie that has the wanted categories or we need to hard-code the curie and/or biolink categories in the all_features yaml file, or even a mix of both. What are your thoughts?

@karafecho
Copy link
Author

karafecho commented Nov 9, 2022

Yeah, I understand the process, and I knew that some of the Biolink categories were being dropped when we started leveraging SRI services, but I wasn't really concerned until recently, when a use case arose. Specifically, ICEES KG is returning environmental exposures such as "benzene" in response to the first hop of Path A in the TCDC's workflow (see slide 10 here). This is introducing noise into the final answer set. As such, we would like to filter chemical exposures from the first hop using an exclude edge, but we cannot do that (I don't think) without attaching a Biolink category such as biolink:Environmental Exposure to non-drug biolink:ChemicalEntity nodes. I had played around with the search terms to see if SRI supported environmental exposures, but I don't think those are represented. In some sense, this is a data modeling issue, but I'd like to identify a quick fix that will resolve the current issue. I am completely open to suggestions.

@maximusunc
Copy link
Contributor

It's up to you. From my end, I would just need to rerun the precompute script after you update the all_features file.

@karafecho
Copy link
Author

karafecho commented Nov 10, 2022

Let's move forward with hard coding, as I think this will allow us to move in a more timely manner with the TCDC workflow and related Translator efforts. That said, let's hold off on running the precompute script until after it is updated to include new calculations (see #13, #14, #15, #16).

@karafecho
Copy link
Author

To clarify, the all_features YAML files already contain most of the intended Biolink mappings, although I would like to make a few adjustments for consistency. Shouldn't take long.

@karafecho
Copy link
Author

karafecho commented Nov 14, 2022

Update 11.14.2022:

This Node Norm endpoint returns the following output for three test inputs:

PUBCHEM.COMPOUND:2083 (albuterol)

    "type": [
      "biolink:SmallMolecule",
      "biolink:MolecularEntity",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ],

MESH:D052638 (particulate matter)

   "type": [
      "biolink:ComplexMolecularMixture",
      "biolink:ChemicalMixture",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ]

PUBCHEM.COMPOUND:241 (benzene')

    ],
    "type": [
      "biolink:SmallMolecule",
      "biolink:MolecularEntity",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:ChemicalOrDrugOrTreatment",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ],

If I change the search terms by adding "exposure" for the last two variables above, here's what Node Norm outputs:

UMLS:C2136615 (airborne pollutant exposure)

  "type": [
     "biolink:PhenotypicFeature",
     "biolink:DiseaseOrPhenotypicFeature",
     "biolink:ThingWithTaxon",
     "biolink:BiologicalEntity",
     "biolink:NamedThing",
     "biolink:Entity"
   ]

NCIT:C36251 (benzene exposure)

    "type": [
      "biolink:PhenotypicFeature",
      "biolink:DiseaseOrPhenotypicFeature",
      "biolink:ThingWithTaxon",
      "biolink:BiologicalEntity",
      "biolink:NamedThing",
      "biolink:Entity"
    ],

So, Node Norm is now recognizing things like chemical exposures, BUT the mappings to biolink:ChemicalEntity are lost, AND the mappings to biolink:PhenotypicFeature seem a bit weird to me (especially when biolink:EnvironmentalExposure is an option) but are okay-ish.

Decision: (1) Add biolink:EnvironmentalExposure mappings to exposures that Node Norm returns. (2) Ask Biolink team about the mappings for chemical exposures (second set of examples above) and other types of exposures. (3) Address any downstream normalization issues with ICEES output when/if they arise.

@karafecho
Copy link
Author

Noting that the YAML files contain a number of Biolink mappings that are not supported by Node Norm. For instance:

{
  "UMLS:C0019993": {
    "id": {
      "identifier": "UMLS:C0019993",
      "label": "Hospitalization"
    },
    "equivalent_identifiers": [
      {
        "identifier": "UMLS:C0019993",
        "label": "Hospitalization"
      }
    ],
    "type": [
      "biolink:Activity",
      "biolink:ActivityAndBehavior",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:Occurrent",
      "biolink:PhysicalEssenceOrOccurrent"
    ]
  },
  "": null
}

I mapped "hospitalization" to biolink:ClinicalIntervention, which seems more appropriate than the Node Norm mappings that are returned.

@karafecho
Copy link
Author

karafecho commented Nov 15, 2022

Updated decision / action items [assigned to Kara]:

  • Supplement Node Norm Biolink category mappings with hand-curated mappings, defined within the all_feature YAML files, which are more appropriate for certain ICEES KG variables.

  • Create a PR to merge the new YAML files after first validating them.

  • Post a ticket to the Biolink team in order to solicit their expert opinion on questionable Node Norm mappings. See Biolink Categories and Node Norm Mappings biolink/biolink-model#1156.

@karafecho
Copy link
Author

karafecho commented Nov 15, 2022

Notes on supplemental Biolink mappings.

  1. Airborne pollutants were mapped to biolink:ChemicalEntity and biolink:EnvironmentalExposure. The first mapping is redundant with what Node Norm will return, but I think that's okay, as it provides a record for how I mapped prior to splitting ICEES into ICEES+ and ICEES KG, and leveraging SRI services for ICEES KG, rather than human curation, to provide the Biolink mappings.
  2. All landfill, CAFO, and roadway variables (except for roadway type) were mapped to biolink:ComplexChemicalMixture and biolink:EnvironmentalExposure.
  3. All socio-economic exposures (ACS variables) were mapped to biolink:EnvironmentalExposure.
  4. All variables related to clinical interventions (e.g., hospitalization, hospital LOS, ventilation, convalescent plasma, supplemental oxygen) were mapped to, well, biolink:ClinicalIntervention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants