Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of ontology fields (e.g. for Organism) #17

Open
PapoutsoglouE opened this issue Jul 19, 2019 · 10 comments
Open

Use of ontology fields (e.g. for Organism) #17

PapoutsoglouE opened this issue Jul 19, 2019 · 10 comments
Assignees

Comments

@PapoutsoglouE
Copy link
Collaborator

PapoutsoglouE commented Jul 19, 2019

There have been discussions around the best way to use this field.
In MIAPPE, it is expected to be either:

  • an official NCBI taxon identifier, presented as e.g. NCBI:4577, or
  • an institutional identifier, e.g.WUR:1234
    (it is also possible to list crosses, e.g. WUR:1234 x WUR:5678)

CASE 1

For identifiers that actually exist in ontologies, it would really be a great idea to use ISA-Tab's ontology references. It would be convenient in the interface of the ISA Creator too. When searching for a suitable term, it would look like this:
image

And the corresponding part of the study table would look like this (in the ISA Creator):
image

In this case, it is pointing to http://purl.bioontology.org/ontology/NCBITAXON/4577.
This translates to the following raw text in ISA. In the Investigation file:

ONTOLOGY SOURCE REFERENCE
Term Source Name	"NCBITAXON"	
Term Source File	"http://data.bioontology.org/ontologies/NCBITAXON"	
Term Source Version	"8"	
Term Source Description	"National Center for Biotechnology Information (NCBI) Organismal Classification"	

And, more crucially, in the Study file:
image

The configuration for this field looks like this:

<field data-type="string" header="Characteristics[Organism]" is-file-field="false" is-forced-ontology="true" is-hidden="false" is-multiple-value="false" is-required="true">
    <description><![CDATA[(MIAPPE: Organism) An identifier for the organism at the species level. Use of the NCBI taxon ID is recommended.]]></description>
    <default-value><![CDATA[e.g. NCBI:4577]]></default-value>
</field>

CASE 2

If there is no ontology that can be used, and the user has to resort to an institutional identifier, the ISA Creator gives the following error when asked to validate the file:
image
(You can ignore the errors about the example dates - of course e.g. is not a valid date)

As @proccaserra has pointed out, this happens because of the colon (:) in the field. Indeed, if it is removed, there is no error related to this field:
image

In the study file, this translates to:
image
Which is probably not very useful. If validation was not an issue, we could however use something like the following:
image
But opening and validating the files with the ISA Creator results in a familiar error:
image
Indeed if the file above (with WUR:1234 in one field) is saved with the ISA Creator, the contents of the field get split into the Characteristics[Organism] and Term Source REF fields (1234 and WUR respectively) - the same result that the earlier attempt to input WUR:1234 into the ISA Creator had.

Even if it worked, we would be left with the following mapping issue, as MIAPPE's Organism field contents would end up in different places in each case, with a different format.

  • In case 1, the Characteristics[Organism] field must hold the label of the taxon (e.g. Zea mays) to pass validation, and the taxon ID is only visible in the Term Accession Number field, as part of the URI.
  • The middle section is for case 2, with the result from entering directly WUR:1234 in the ISA Creator.
  • In case 2 (bottom), it would be directly in the Characteristics[Organism] field, assuming that validation raised no errors.
    image

Only the top case in the screenshot gives no errors, but it is also unusable for institutional identifiers (since in many cases there are no ontologies, and other cases require crosses).

Current solution

Currently, the Characteristics[Organism] field is set to be a plain string field. This means that it does not use ISA's proper way of referring to ontology classes, but it also means that:

  • there are no validation issues with any identifiers, and
  • the Characteristics[Organism] field is always where these identifiers end up, and finally
  • There is no need to perform any extra steps to retrieve the identifier from the URI or any reference content that the URI points to.

Therefore, currently, the field's configuration is:

<field data-type="string" header="Characteristics[Organism]" is-file-field="false" is-forced-ontology="false" is-hidden="false" is-multiple-value="false" is-required="true">
    <description><![CDATA[(MIAPPE: Organism) An identifier for the organism at the species level. Use of the NCBI taxon ID is recommended.]]></description>
    <default-value><![CDATA[e.g. NCBI:4577]]></default-value>
</field>

Other options

Is there some other possible solution to the issues discussed above?
Otherwise, are there any strong opinions for choosing one option (field is an ontology term) over the other (field is plain string)?

@PapoutsoglouE
Copy link
Collaborator Author

@arendd, @DanFaria, @hcwi

@proccaserra
Copy link
Collaborator

@PapoutsoglouE

I went back to the thread, to the configuration and to the BRAPI messages.
I am afraid there is no solution to that issue within ISAcreator:
The ISA field "Characteristics[Organism]" if set to be under ontology control will accept a string and if marked up with an ontology term, ISAcreator will render that string with the following pattern
:<ontology term (i.e. a string> , typically when using NCBITaxonomy, something like NCBITaxon:Zea Mays

This is how things are rendered in the ISA user interface.
This is summarized in "Case1" (nothing new), which breaks for the reasons you have detailed.

This leaves us with the only workable option (Case2) for using ISAcreator and retaining the ability to accepting strings containing a colon (:) as input.

The one thing I would change if the value for the default value in the configuration, from:

to:

This is to avoid confusing users (in ISAcreator at least) thus indicated that the field ISA Characteristics[Organism] should hold a class label(string) rather a compact URI (curi)

This brings in fact another issue which is worth discussing:

It is the use of something akin to a curi such as NCBITAXON:4577 in an ISA field which is meant to hold the Ontology Class Label.

This is somehow a problem in the mapping we do between BRAPI TaxonID and ISA.Characteristics[organism]

I was checking BRAPI documentation on apiary and it seems that from v1.3, "species" is deprecated in favor of 'germplasmSpecies'

as per BRAPI documentation:
TaxonIds | array[object] | The list of IDs for this SPECIES from different sources. If present, NCBI Taxon should be always listed as "ncbiTaxon" preferably with a purl. The rank of this ID should be species.

ISA.Organism = concatenation of BRAPI Genus + BRAPI Species

This means we'd need to alter the 'create_isa_characteristic' function in brapi_to_isa_converter.py to access additional arguments to populate term_source and term_accession.
I'll try it out and push a fix to the repo.

Final note:
I have also tried to escape the colon (:) in ISAcreator (using "WUR:1245") and this caused an silent failure of the java ISA validator so not good either.

best

@PapoutsoglouE
Copy link
Collaborator Author

PapoutsoglouE commented Aug 22, 2019

The change you suggested isn't showing on github for me for some reason, though I did receive it via mail. But to double check that I understand everything properly:

<field data-type="string" header="Characteristics[Organism]" is-file-field="false" is-forced-ontology="false" is-hidden="false" is-multiple-value="false" is-required="true">
    <description><![CDATA[(MIAPPE: Organism) An identifier for the organism at the species level. Use of the NCBI taxon ID is recommended.]]></description>
    <default-value><![CDATA[NCBITAXON:Zea Mays]]></default-value>
</field>
  1. The field is a plain string field, so colons will cause no issues. But the Term Source REF and Term Accession Number columns can be added next to it anyway, and they can be used by people who want to use an ontology.
  2. So, when the Term Source REF is NCBITAXON, we know that the taxon ID is the last part of the URI in the Term Accession Number field:
    http://purl.bioontology.org/ontology/NCBITAXON/4577
    The contents of the Characteristics[Organism] field are actually not useful in this case, since there are dedicated Characteristics for both the genus and the species.
    So the MIAPPE equivalent in this case is NCBI:4577.
  3. Otherwise, when Term Source REF is empty, we can look directly in the Characteristics[Organism] to get the institutional identifier that MIAPPE uses (e.g. WUR:1234).

Do I understand all of points 1, 2 and 3 correctly?

Furthermore, do I understand correctly that (4) the above will cause no issues with validation, and the tools we have around this (like brapi2isa) can use the above logic for all cases?

@proccaserra
Copy link
Collaborator

@PapoutsoglouE I have tested the above validating an exemplar dataset and it did not throw errors.

Do I understand all of points 1, 2 and 3 correctly?
yes .

but I have a question for you:
What do you mean by 'MIAPPE equivalent' (e.g when you refer to http://purl.bioontology.org/ontology/NCBITAXON/4577 and NCBI:4577) ?

It seems that NCBI taxonomy 'official' shorthand is 'NCBITAXON' (to build resolvable URI under obofoundry.org in the purl format).

Does it mean that MIAPPE maintains a list of shorthand as well?

If not, it may be beneficial to align with what is being used by OLS, Bioportal.

thx!

@PapoutsoglouE
Copy link
Collaborator Author

PapoutsoglouE commented Aug 22, 2019

Good to hear I got that right!
About the equivalents, I just meant that MIAPPE recommends NCBI:xxxx, ie. it uses the NCBI prefix over NCBITAXON, or institutions can decide what prefix to use for themselves. But it does need to have the numerical ID (in the case of NCBI) somewhere.

On that point, I made the change you suggested (for the default-value field) to test it locally.

As intended, it makes it clear to ISA-Creator users that they should see a class label in the Characteristics[Organism] field. So now, when I load the configuration into the Creator and make a new study with default values, I see the following:
image

It is clear that the field should have a class label (genus + species) if we want to use the ontology term. But how do we do this?
Is there a way to make the Term Source REF and Term Accession Number columns show there, or a way to add them via the interface? This can then be documented in the README for the configuration. Otherwise, where would the NCBI ID itself go?

@proccaserra
Copy link
Collaborator

@PapoutsoglouE brilliant!
regarding your question 'how we do this'? the 2 fields Term Source REF & Term Accession Number are always hidden to the users in ISAcreator.

The option works fine when batch processing/converting from the BRAPI2ISA service we have just fixed with @bedroesb but if doing it manually, there are no simple ways.

One option would be to let users enter free text for that fields and then run the NCBO Annotator service with an ISA configuration set to be an ontology annotation field (but this is probably too complex already and impractical). I would have to try it out myself first

the other option is to follow what is found in BRAPI, i.e. markup species with the TaxonID . so the ISA.characteristic[organism] gets a string which isn't ontology-annotated but the ISA.Characteristic[species] would be from an ontology term. but we'd simply be moving the problem to 'species' then...

the last line (desperate?) would be to use an ISA.Comment[uri] field. It would be ignored by the validation service and would required a post processing script over the ISA to sanizatize it. again not ideal

@PapoutsoglouE
Copy link
Collaborator Author

My concern is not for people who are using BrAPI2ISA or directly manipulating the text files, but those who have chosen the Creator instead. After all, it is valued as the most user-friendly option for newcomers.
Please do test your first idea when you can!

The second option would indeed be moving the problem to species. Please do elaborate on the last line option though!


Another option that is now occurring to me would be to maintain the Characteristics[Organism] as it was, i.e. a string formatted as NCBI:xxxx, with no ontology references involved. We can then advise users to add another column for their own use, Characteristics[NCBI], which would be dedicated to NCBI taxon IDs and can therefore use proper references. It would be redundant with Organism though, and MIAPPE tools could just ignore it altogether.
The extra characteristic (qualitative) can be added easily through the interface (add characteristic button), and then I am able to double click and search for "Zea mays", for which NCBITAXON is really an option!
image
I cannot do the same with the Organism field, presumably because it is declared as a string.

This looks nice in the Creator:
image

And what actually gets saved is:
image

What do you think?

@DanFaria
Copy link

I think we are losing sight of our goal and straying too far from MIAPPE.
Unless it is literally impossible to do so, a MIAPPE field should correspond to a single ISA field, or we can't really call this an ISA-Tab implementation of MIAPPE.
And because MIAPPE cannot be updated for the time being (apart from minor corrections), we have to comply with it in its present state.

In MIAPPE, after much debate, we made the conscious decision of not enforcing NCBITaxon identifiers for the Organism field, because in practical cases you can have plant crosses or varieties that are not listed in the NCBITaxon. What is desired for Organism is an identifier that is both unique and univocally corresponding to the actual biological material. We explicitly discussed the option of having the NCBITaxon anyway (in addition to another more specific identifier when needed) but deemed it unnecessary.

In order to comply with MIAPPE, we have to respect this modeling decision and stick with a single field for Organism in ISA-Tab, which takes a unique identifier with the form String_prefix-colon-numeric_id. If, on the side of ISA, this means that we must configure the field as a String due to the issue with colons, and cannot use the ontology ref field because the use of NCBITaxon is not mandatory, then so be it. It means that we cannot use ISA functionalities to select an adequate taxon id or to validate that the field is one, but that is the price MIAPPE pays for flexibility. I reiterate, staying faithful to MIAPPE should be our first and foremost concern in the ISA implementation.

People using MIAPPE ISA-Tab are, of course, free to add fields as they need and see fit (MIAPPE is a minimum specification, not a comprehensive one) but the default configuration should not deviate from the MIAPPE checklist.

@DanFaria
Copy link

@proccaserra I noticed you went ahead and pushed a commit to v1.1 changing the example from NCBI:4577 to NCBITAXON:4577.

There is no question that the correct prefix to use is indeed NCBITaxon, and it should have been the one used in the example in the MIAPPE checklist. However, by the same rationale as in my reply above, it is critical that the MIAPPE ISA-Tab implementation remain faithful to the MIAPPE checklist, which should be the reference point for documentation on any field.

In the future, if you find any incongruence with the MIAPPE examples, please raise an issue about them on the MIAPPE checklist repository, so that they can be fixed there. Only after they are should they be fixed in the ISA-Tab implementation, in order to ensure that the latter remains faithful to the former.

Additionally, I noticed you removed the "e.g." from the examples, a change which I am not fond of. While the fields are intended as examples, as far as ISA-Tab is concerned, they are default values. In my view, it is important that such examples not appear to be valid values (which they do without the e.g.), as otherwise you risk people not filling in a mandatory field but still obtaining a valid MIAPPE ISA-Tab, because the field was automatically filled with the default value (which likely is erroneous for that dataset).

@PapoutsoglouE
Copy link
Collaborator Author

PapoutsoglouE commented Aug 27, 2019

Issue to fix NCBI prefix created: MIAPPE/MIAPPE#57
(merged existing PR and deleted branch in this repository)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants