Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ontology export columns are incomplete #92

Closed
lgeistlinger opened this issue Jul 14, 2021 · 88 comments
Closed

Ontology export columns are incomplete #92

lgeistlinger opened this issue Jul 14, 2021 · 88 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request priority necessary for early utility

Comments

@lgeistlinger
Copy link
Collaborator

lgeistlinger commented Jul 14, 2021

@tosfos I'd like to filter signatures exported from BugSigDB by review status for #80.
This seems currently not possible as the only column that we have in the export csv files on review status is Revision editor.

However, the field Revision editor has value "WikiWorks743" for reviewed content (see eg https://bugsigdb.org/Study_255) as well as content that still needs to be reviewed (see eg https://bugsigdb.org/Study_400). Can this be changed so that this column is empty or NA in the export for content that still need to be reviewed?

@tosfos
Copy link
Collaborator

tosfos commented Jul 14, 2021

That sounds good. Should we also add code for the wiki to ignore the WikiWorks... users for these fields? Usually our edits are not very scientifically useful so we shouldn't be getting credit :).

@lwaldron
Copy link
Member

Yeah that would make sense to hide the Wikiworks74 etc curations :)

@lgeistlinger
Copy link
Collaborator Author

There are some more considerations to this though. From the discussion arising in #93, it becomes clear that we would actually like the exported column "Revision editor" to list the person who marked the content as reviewed as @ftzohra22 points out in #93 (comment).

@lgeistlinger lgeistlinger added the enhancement New feature or request label Jul 14, 2021
@lwaldron
Copy link
Member

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

@lgeistlinger
Copy link
Collaborator Author

Agreed.

@lwaldron lwaldron added the priority necessary for early utility label Jul 29, 2021
@lwaldron
Copy link
Member

Adding a priority label to this as we will make our first versioned data release (waldronlab/BugSigDBExports#4) as soon as it is resolved.

@lgeistlinger lgeistlinger changed the title Include review status in export Additional export columns Jul 31, 2021
@lgeistlinger
Copy link
Collaborator Author

Moving @lwaldron's comment from #93 here:

I also noted that the "Complete/Incomplete" status would be worth exporting too. So the two items here are to include in the export

1. "Reviewed by" (missing or who marked study/experiment/signature as reviewed), and

2. Complete or Incomplete (or TRUE/FALSE)

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Jul 31, 2021

I'd like to add two more columns in the export:

  1. The EFO ID of the corresponding condition investigated (such as EFO:0001075 for "ovarian carcinoma")
  2. The UBERON ID of the corresponding body site sampled (such as UBERON:0000341 for "throat")

as it arises from #55 (comment).
This will facilitate ontology-based queries downstream of the export in eg bugsigdbr and/or BugSigDBStats.

@tosfos
Copy link
Collaborator

tosfos commented Aug 2, 2021

Sorry. I'm not clear on how to proceed with the WikiWorks user removal. Right now Curator and Revision Editor can list a WikiWorks user. Do we want to change either one or both? And if we are removing the WikiWorks user, is it OK if we end up with these fields as blank?

@tosfos
Copy link
Collaborator

tosfos commented Aug 2, 2021

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

Do we also want the "Reviewer" field to be prominently displayed next to the Revision Editor field (on the site itself)? Or is OK that you need to hover over the (!) icon in order to see it?

@lwaldron
Copy link
Member

lwaldron commented Aug 2, 2021

Sorry. I'm not clear on how to proceed with the WikiWorks user removal. Right now Curator and Revision Editor can list a WikiWorks user. Do we want to change either one or both? And if we are removing the WikiWorks user, is it OK if we end up with these fields as blank?

I don't feel strongly about this, but blank seems equally informative as WikiWorks user, so fine with me. I would want to search for those pages where both curation and revision editor are blank, and at least make sure they are at least reviewed. I don't think it seems worthwhile to dig out again the pre-wiki curators of these pages.

Do we also want the "Reviewer" field to be prominently displayed next to the Revision Editor field (on the site itself)? Or is OK that you need to hover over the (!) icon in order to see it?

Again I don't feel strongly about this. Both seem OK, although displaying the Reviewer may be a bit better to recognize reviewing as an important contribution. FYI in case this helps inform a decision: we currently have many unreviewed pages, and although we will catch up some, reviewing is almost as much work as the original curation and it seems likely that curation will always outpace reviewing.

@tosfos
Copy link
Collaborator

tosfos commented Aug 9, 2021

Complete or Incomplete (or TRUE/FALSE)

This part is now "Complete". The requested column can be found in any CSV.

@tosfos
Copy link
Collaborator

tosfos commented Aug 10, 2021

It seems "Revision editor" is correct, but "Reviewer" could be another column in the export.

Done.

@tosfos
Copy link
Collaborator

tosfos commented Aug 10, 2021

I'd like to add two more columns in the export:

1. The EFO ID of the corresponding condition investigated (such as [EFO:0001075](http://www.ebi.ac.uk/efo/EFO_0001075) for "ovarian carcinoma")

2. The UBERON ID of the corresponding body site sampled (such as [UBERON:0000341](http://purl.obolibrary.org/obo/UBERON_0000341) for "throat")

Please see if this looks OK in the new CSVs.

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Aug 16, 2021

As noted in #94, it is problematic for downstream applications that these changes to the export files also cause the link locations to change. It would be preferable if we would have stable links such as eg https://bugsigdb.org/export/experiments.csv that would always point to the latest version of these bulk export files.

@lgeistlinger
Copy link
Collaborator Author

Columns State (for completion status) and Reviewer (for review status) look great.

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Aug 16, 2021

EFO / UBERON ID: It looks like #55 (comment) precedes this, as these columns are currently mostly blank. We are missing eg the corresponding EFO ID for terms like "adenoma". This is true for the export files but also the pages itself (eg https://bugsigdb.org/Study_1/Experiment_1), where links to the EFO IDs / EFO pages are missing.

Also, IDs such as EFO:0001075 and UBERON:0000341 are currently abbreviated as 1075 and 341 in the export columns, but I think it would be preferable to have them exported fully spelled out.

@tosfos
Copy link
Collaborator

tosfos commented Aug 18, 2021

I figured out what's happening. The leading zeros are being exported, as you can see in a text editor:
image

The spreadsheet application is interpreting these as numbers and removing the leading zeros, but it's not the application's fault. Semantic MediaWiki should be surrounding these fields with quotes, since they are stored internally as a text data type. This is likely a bug with Semantic MediaWiki. We'll research this and see what we can do.

@tosfos
Copy link
Collaborator

tosfos commented Aug 19, 2021

My previous comment was incorrect. Semantic MediaWiki is doing everything right. The SMW CSV exporter simply uses the PHP native fputcsv to create its CSV files. And the CSV spec appears to say that numbers don't need to be wrapped in quotes. So the spreadsheet application is what is at fault since technically all CSV fields should be interpreted as text, but I guess it's doing its best to figure out what a user would expect.

If we want to work around this we would need to modify the extension to use our own custom CSV encoder, which would be a bit of a project and probably not a good idea.

@lgeistlinger
Copy link
Collaborator Author

Right, the IDs are indeed exported with leading zeros (my bad), but would it be possible to export them with leading "EFO:" for condition and leading "UBERON:" for body site? Not a problem if not, as we can also add them downstream in our R application. It would just be more convenient and more straightforward to use for everyone who doesn't use the R application but rather works on the exported files themselves.

@lwaldron
Copy link
Member

"EFO:" and "UBERON:" are also not a bad idea to include if it's not difficult because there are also other potentially relevant ontologies. We could in theory mix ontologies and use downstream tools to map them, and even if we never do that, it's more communicative to someone downloading the file who isn't familiar with its contents.

@lgeistlinger
Copy link
Collaborator Author

It's a great point. As we already observed (#55 (comment)), EFO actually already is an umbrella ontology, meaning not all EFO IDs start with "EFO:", but also "CHEBI:", "Orphanet:", "HP:", and "MONDO:", ...
In numbers only 11,338 out of 27,175 terms in the EFO start with "EFO:". We thus indeed need the prefixes as available from the Term ID <-> Term name mappings.

@lgeistlinger
Copy link
Collaborator Author

Interesting, I saw that too. I don't think we would have a particular use case for that nor do I expect curators to really make use of it, as I expect all curated articles to be in English.

@tosfos
Copy link
Collaborator

tosfos commented Nov 9, 2021

Right now, we have separate columns for Uberon and EFO IDs. And we have icons for Uberon, EFO and ORPHANET. But I'm seeing URLS for MONDO, OBI, HP and others. How should we handle these? For example, see here. Is the icon correct? And which column would we use for the ID? Should there only be one ID column in the export?

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Nov 10, 2021

Thanks @tosfos. As explained here, we'd like to have all IDs imported from the condition.csv to be exported in the EFO ID column; and all IDs imported from the bodysite.csv to exported in the UBERON ID column.

For body site, this will always be UBERON IDs. But for condition this will be all kind of EFO IDs (not all start with "EFO:", but also "MONDO:", "ORPHANET", ... or even "UBERON", as some UBERON terms are part of the EFO) - and those we'd like to export in the EFO ID column. Thanks!

@tosfos
Copy link
Collaborator

tosfos commented Nov 10, 2021

Right. Sorry that I keep getting confused about this.

@tosfos
Copy link
Collaborator

tosfos commented Nov 10, 2021

It seems like certain terms exist in both the Body site and the Condition CSVs. Is that something that should be possible? See, for example, blood.

@lgeistlinger
Copy link
Collaborator Author

Yes, this is correct, there are certain UBERON terms contained in the EFO. We haven't seen examples of curators picking up certain body sites (such as blood) as conditions, but it is technically possible.

Screen Shot 2021-11-11 at 11 23 00 AM

Screen Shot 2021-11-11 at 11 23 27 AM

@tosfos
Copy link
Collaborator

tosfos commented Nov 11, 2021

Got it. There is an issue then on the MediaWiki side. We can't have 2 different pages that share the same title. And if we try to manufacture 2 different titles like "Blood (condition)" and "Blood (body site)" it makes it more complicated to match the correct Glossary page with the correct term (which is just entered as "Blood").

Would it make sense that the Condition and Body site definitions of a single term could be different? Or would have 2 different URLs? Or would the "Condition" and "Body site" definitions always be the same? If the 2 Glossary pages would be alike in every way, what we can do is allow a single Glossary page to belong to both categories.

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Nov 11, 2021

I think if this simplifies things, we can just remove the UBERON terms from the condition.csv - as their use for condition is largely hypothetical and not something that we've observed in practice when curating > 500 articles.

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Nov 11, 2021

Or would the "Condition" and "Body site" definitions always be the same? If the 2 Glossary pages would be alike in every way, what we can do is allow a single Glossary page to belong to both categories.

I think that would also be a perfectly fine solution - as the answer is yes to the question whether the definitions (and corresponding glossary pages) would be identical.

@tosfos
Copy link
Collaborator

tosfos commented Nov 14, 2021

Some more questions regarding the aliases.

If someone enters an alias, the wiki tries to find the Main Term (Glossary page title) in the Glossary. That is, a Glossary term that has this alias listed. Once a match is found, it displays the Main Term instead of the Alias that was entered. It also finds ALL of the aliases for this main term and inserts them in the page (but hides them) so that a search for any alias of this term will find this page.

A similar behavior applies if someone actually enters a term that IS a Main Term in the Glossary. The wiki will display this Main Term but also look up all aliases for this term and stores it in the page.

Please let me know if any of the above is incorrect behavior.

We are running into an issue in that sometimes the same alias is listed on more than one Glossary page. For example, gastric carcinoma and stomach neoplasm both list "Gastric Cancer" as an alias. And to complicate things, gastric cancer is also a Glossary Main Term with its own page.

What should the wiki do in this situation? Right now it tries to find EVERY term that matches "Gastric Cancer" and it lists all 3 of them on the page like:

Condition: gastric cancer , gastric carcinoma , stomach neoplasm

And it also stores the full list of all aliases from all 3 Glossary pages within the page. Is this correct?

I'm thinking it would be better to show the Main Term only. And, if a matching Main Term is not found, then only show the Main Term of the first Glossary page that has a matching alias. But under all circumstances only one Condition (or Body Site) should be displayed. What do you think?

And also, how should we deal with storing aliases then? Do we want to cast the net as wide as possible and find every single matching Alias? Or just the aliases of the first matching term that was found? (Note that there is nothing special about the first matching term. It's probably the first one the wiki will find, alphabetically.)

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Nov 15, 2021

I believe we should follow here the behavior on the EFO site itself, and push curators upon entering an alias into the condition form towards selecting a main term. That means if a curator enters a synonym, autocomplete options presented would be corresponding main terms for that alias only. (That means I would here indeed depart from allowing curators to be able to enter an alias).

Using the "gastric cancer" example, and looking into which autocomplete options become available when typing "gastric cancer" into the search field, it seems that EFO provides corresponding main terms (potentially mapped via synonyms as for "stomach neoplasm").

Screen Shot 2021-11-14 at 6 49 37 PM

This is also true when specifically typing "stomach cancer", which is an alias for "gastric cancer" (and potentially also others).

Screen Shot 2021-11-14 at 6 50 34 PM

@tosfos
Copy link
Collaborator

tosfos commented Nov 17, 2021

Interesting idea. Let me see what we can do.

But we already have the pages we imported which already contain aliases. How should we deal with these existing pages?

@lgeistlinger
Copy link
Collaborator Author

But we already have the pages we imported which already contain aliases. How should we deal with these existing pages?

Sorry for the delay in replying. I am not sure I fully understand the practical implications here. Why would the imported pages pose a problem?

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Nov 25, 2021

I also just took another look into the exported experiments.csv and was happy to see that the gaps in the EFO ID / UBERON ID columns are getting smaller. However, we seem to still have some issues for valid conditions and body sites that were part of the import.

Conditions with empy EFO ID column, although they represent valid glossary terms

Genital neoplasm, female (there seems to be hiccup due to the ", " notation, see eg https://bugsigdb.org/Study_305)

Body sites with empty UBERON ID column, although they represent valid glossary terms:

breast (is turned into "obsolete_mammary gland", see eg https://bugsigdb.org/Study_9)
lung (is turned into "lung neoplasm", see eg https://bugsigdb.org/Study_5)
nasal cavity (is turned into "obsolete_olfactory pit", see eg https://bugsigdb.org/Study_302)
gingiva (is turned into "obsolete_gum", see eg https://bugsigdb.org/Study_284)
hypopharynx (is turned into "obsolete_laryngopharynx", see eg https://bugsigdb.org/Study_394)
ovary (is turned into "obsolete_animal ovary", see eg https://bugsigdb.org/Study_517)

P.S.: Editing conditions and body sites in the experiment form works great, but autocomplete options take a while to load. Any chance this can be faster than that?

@tosfos
Copy link
Collaborator

tosfos commented Nov 29, 2021

Genital neoplasm, female (there seems to be hiccup due to the ", " notation, see eg https://bugsigdb.org/Study_305)

Yes, it looks like the comma is the culprit. Will fix.

Body sites with empty UBERON ID column, although they represent valid glossary terms:

It looks to me like the likely culprit is that the wiki is finding a Glossary term that is listed as a Condition even though this is the Body site field. We'll limit the search and see if that fixes all of these.

Editing conditions and body sites in the experiment form works great, but autocomplete options take a while to load.

Previously all the values were being loaded at the time the form was first loaded, which meant that autocomplete was very quick. Once we imported the full set of these fields' values, we had to change the autocomplete method we were using. Instead of loading them all at the time the form was being loaded (which would no longer be viable), we're now loading them via AJAX while the user types in each field. It is slow though. Let me see what we can do.

@tosfos
Copy link
Collaborator

tosfos commented Nov 29, 2021

Why would the imported pages pose a problem?

The wiki treats all data the same, regardless of if it comes from the CSV import, or if it is manually inserted by a user. So for the imported pages, the wiki will perform the same alias lookup. In other words, if a Main term wasn't entered in the CSV, the wiki will search on-the-fly to find a matching Main term that has the entered term as one of its aliases. So what happens if multiple matching Main terms are found that all have the entered term as one of its aliases. We need to figure out which Main term should be displayed and store in the Condition/Body site field. Does that make sense?

@lgeistlinger
Copy link
Collaborator Author

Yes that makes sense. Let me start with saying that I don't think showing multiple matching terms is a bad solution. In the end, these are coming in via existing aliases and there are certainly situations, where showing multiple main terms better describes the condition than just one main term. For example, we found "(one sort of) cancer" / "(one sort of) carcinoma" being used rather exchangably in the literature, and instead of having two sets of signatures (ie one for "(one sort of) cancer" and one for "(one sort of) carcinoma"), it simplifies downstream analysis if those come grouped together right from the start.

That said, do you internally distinguish between main terms and aliases? Or do you throw them all into one pot when comparing against the conditions / body sites that were entered by the curators into the experiment forms? If you distinguish, one solution might be to first check whether there is a main term that exactly matches the condition / body site entered by a curator on the experiment form. If so, then take the main term. If there is no exactly matching main term, go the alias route, and here there might be indeed multiple matching main terms, and here there will be no best solution, and showing all would be just fine I think.

Taking the example of "urinary tract infection' as in #107; here we have an exact matching main term urinary tract infection which could be displayed. There is not necessarily a need for going down the alias route and also bringing in "bacterial urinary tract infection".

@tosfos
Copy link
Collaborator

tosfos commented Nov 30, 2021

Yes, it looks like the comma is the culprit. Will fix.

This should now be fixed.

@tosfos
Copy link
Collaborator

tosfos commented Dec 1, 2021

That said, do you internally distinguish between main terms and aliases? Or do you throw them all into one pot when comparing against the conditions / body sites that were entered by the curators into the experiment forms?

Right now I don't think there is any internal difference between entered main terms and aliases. If an alias is entered, we just store the main term anyway.

How should we proceed then? (It sounded like your instructions were only intended if we did distinguish.)

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Dec 1, 2021

I think if the aforementioned issues can be fixed, we are actually in a good spot with the solution that you've put in place.

@tosfos
Copy link
Collaborator

tosfos commented Dec 8, 2021

Everything look better now. Any filled body site field now has a matching UBERON ID. The only conditions with missing EFO ID columns are:

  1. hepatitis, alcoholic
  2. antibiotic exposure
  3. antibiotic treatment
  4. tumor grade
  5. viral lung infection
  6. gonorrhea
  7. healthy
  8. perinatal antibiotics

@lgeistlinger
Copy link
Collaborator Author

lgeistlinger commented Dec 9, 2021

Looks like we are getting close!

Taking a look at the exported experiments.csv under https://bugsigdb.org/Help:Export, I took the following actions:

  1. hepatitis, alcoholic -> non-empty EFO ID column (EFO:1001345)
  2. antibiotic exposure -> changed to "antimicrobial agent"
  3. antibiotic treatment -> changed to "antimicrobial agent"
  4. tumor grade -> changed to "breast cancer"
  5. viral lung infection -> changed to "viral pneumonia"
  6. gonorrhea -> ok
  7. healthy -> ok
  8. perinatal antibiotics -> changed to "antimicrobial agent"

In addition, I am seeing a handful of instances / experiments where a valid glossary term has an empty EFO ID column, although it is non-empty for the same glossary term for other instances / experiments:

Instances of conditions with empy EFO ID column, although they represent valid glossary terms and are matched to the corresponding glossary term for other instances / experiments:

obesity (Study 46, Experiment 1)
hashimoto's thyroiditis (Study 332, Experiment 1)
response to allogeneic hematopoietic stem cell transplant (Study 381, Experiment 2)
diet (Study 131, Experiment 4)
antimicrobial agent (Study 320, Experiment 5)

@tosfos
Copy link
Collaborator

tosfos commented Dec 16, 2021

Thanks! Probably just needs a data rebuild after some new changes we made. We'll check.

@lgeistlinger
Copy link
Collaborator Author

Great, thanks!

@lgeistlinger
Copy link
Collaborator Author

Just did a fresh export of the experiments.csv - and everything looked good, no more missing items! I believe we can close this.

@tosfos
Copy link
Collaborator

tosfos commented Dec 20, 2021

Great!

@lgeistlinger lgeistlinger unpinned this issue Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request priority necessary for early utility
Projects
None yet
Development

No branches or pull requests

4 participants