-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ontology export columns are incomplete #92
Comments
That sounds good. Should we also add code for the wiki to ignore the WikiWorks... users for these fields? Usually our edits are not very scientifically useful so we shouldn't be getting credit :). |
Yeah that would make sense to hide the Wikiworks74 etc curations :) |
There are some more considerations to this though. From the discussion arising in #93, it becomes clear that we would actually like the exported column "Revision editor" to list the person who marked the content as reviewed as @ftzohra22 points out in #93 (comment). |
It seems "Revision editor" is correct, but "Reviewer" could be another column in the export. |
Agreed. |
Adding a priority label to this as we will make our first versioned data release (waldronlab/BugSigDBExports#4) as soon as it is resolved. |
Moving @lwaldron's comment from #93 here:
|
I'd like to add two more columns in the export:
as it arises from #55 (comment). |
Sorry. I'm not clear on how to proceed with the WikiWorks user removal. Right now Curator and Revision Editor can list a WikiWorks user. Do we want to change either one or both? And if we are removing the WikiWorks user, is it OK if we end up with these fields as blank? |
Do we also want the "Reviewer" field to be prominently displayed next to the Revision Editor field (on the site itself)? Or is OK that you need to hover over the (!) icon in order to see it? |
I don't feel strongly about this, but blank seems equally informative as WikiWorks user, so fine with me. I would want to search for those pages where both curation and revision editor are blank, and at least make sure they are at least reviewed. I don't think it seems worthwhile to dig out again the pre-wiki curators of these pages.
Again I don't feel strongly about this. Both seem OK, although displaying the Reviewer may be a bit better to recognize reviewing as an important contribution. FYI in case this helps inform a decision: we currently have many unreviewed pages, and although we will catch up some, reviewing is almost as much work as the original curation and it seems likely that curation will always outpace reviewing. |
This part is now "Complete". The requested column can be found in any CSV. |
Done. |
Please see if this looks OK in the new CSVs. |
As noted in #94, it is problematic for downstream applications that these changes to the export files also cause the link locations to change. It would be preferable if we would have stable links such as eg |
Columns |
EFO / UBERON ID: It looks like #55 (comment) precedes this, as these columns are currently mostly blank. We are missing eg the corresponding EFO ID for terms like "adenoma". This is true for the export files but also the pages itself (eg https://bugsigdb.org/Study_1/Experiment_1), where links to the EFO IDs / EFO pages are missing. Also, IDs such as |
My previous comment was incorrect. Semantic MediaWiki is doing everything right. The SMW CSV exporter simply uses the PHP native fputcsv to create its CSV files. And the CSV spec appears to say that numbers don't need to be wrapped in quotes. So the spreadsheet application is what is at fault since technically all CSV fields should be interpreted as text, but I guess it's doing its best to figure out what a user would expect. If we want to work around this we would need to modify the extension to use our own custom CSV encoder, which would be a bit of a project and probably not a good idea. |
Right, the IDs are indeed exported with leading zeros (my bad), but would it be possible to export them with leading "EFO:" for condition and leading "UBERON:" for body site? Not a problem if not, as we can also add them downstream in our R application. It would just be more convenient and more straightforward to use for everyone who doesn't use the R application but rather works on the exported files themselves. |
"EFO:" and "UBERON:" are also not a bad idea to include if it's not difficult because there are also other potentially relevant ontologies. We could in theory mix ontologies and use downstream tools to map them, and even if we never do that, it's more communicative to someone downloading the file who isn't familiar with its contents. |
It's a great point. As we already observed (#55 (comment)), EFO actually already is an umbrella ontology, meaning not all EFO IDs start with "EFO:", but also "CHEBI:", "Orphanet:", "HP:", and "MONDO:", ... |
Interesting, I saw that too. I don't think we would have a particular use case for that nor do I expect curators to really make use of it, as I expect all curated articles to be in English. |
Right now, we have separate columns for Uberon and EFO IDs. And we have icons for Uberon, EFO and ORPHANET. But I'm seeing URLS for MONDO, OBI, HP and others. How should we handle these? For example, see here. Is the icon correct? And which column would we use for the ID? Should there only be one ID column in the export? |
Thanks @tosfos. As explained here, we'd like to have all IDs imported from the condition.csv to be exported in the For body site, this will always be UBERON IDs. But for condition this will be all kind of EFO IDs (not all start with "EFO:", but also "MONDO:", "ORPHANET", ... or even "UBERON", as some UBERON terms are part of the EFO) - and those we'd like to export in the EFO ID column. Thanks! |
Right. Sorry that I keep getting confused about this. |
It seems like certain terms exist in both the Body site and the Condition CSVs. Is that something that should be possible? See, for example, blood. |
Got it. There is an issue then on the MediaWiki side. We can't have 2 different pages that share the same title. And if we try to manufacture 2 different titles like "Blood (condition)" and "Blood (body site)" it makes it more complicated to match the correct Glossary page with the correct term (which is just entered as "Blood"). Would it make sense that the Condition and Body site definitions of a single term could be different? Or would have 2 different URLs? Or would the "Condition" and "Body site" definitions always be the same? If the 2 Glossary pages would be alike in every way, what we can do is allow a single Glossary page to belong to both categories. |
I think if this simplifies things, we can just remove the UBERON terms from the condition.csv - as their use for condition is largely hypothetical and not something that we've observed in practice when curating > 500 articles. |
I think that would also be a perfectly fine solution - as the answer is yes to the question whether the definitions (and corresponding glossary pages) would be identical. |
Some more questions regarding the aliases. If someone enters an alias, the wiki tries to find the Main Term (Glossary page title) in the Glossary. That is, a Glossary term that has this alias listed. Once a match is found, it displays the Main Term instead of the Alias that was entered. It also finds ALL of the aliases for this main term and inserts them in the page (but hides them) so that a search for any alias of this term will find this page. A similar behavior applies if someone actually enters a term that IS a Main Term in the Glossary. The wiki will display this Main Term but also look up all aliases for this term and stores it in the page. Please let me know if any of the above is incorrect behavior. We are running into an issue in that sometimes the same alias is listed on more than one Glossary page. For example, gastric carcinoma and stomach neoplasm both list "Gastric Cancer" as an alias. And to complicate things, gastric cancer is also a Glossary Main Term with its own page. What should the wiki do in this situation? Right now it tries to find EVERY term that matches "Gastric Cancer" and it lists all 3 of them on the page like: Condition: gastric cancer , gastric carcinoma , stomach neoplasm And it also stores the full list of all aliases from all 3 Glossary pages within the page. Is this correct? I'm thinking it would be better to show the Main Term only. And, if a matching Main Term is not found, then only show the Main Term of the first Glossary page that has a matching alias. But under all circumstances only one Condition (or Body Site) should be displayed. What do you think? And also, how should we deal with storing aliases then? Do we want to cast the net as wide as possible and find every single matching Alias? Or just the aliases of the first matching term that was found? (Note that there is nothing special about the first matching term. It's probably the first one the wiki will find, alphabetically.) |
I believe we should follow here the behavior on the EFO site itself, and push curators upon entering an alias into the condition form towards selecting a main term. That means if a curator enters a synonym, autocomplete options presented would be corresponding main terms for that alias only. (That means I would here indeed depart from allowing curators to be able to enter an alias). Using the "gastric cancer" example, and looking into which autocomplete options become available when typing "gastric cancer" into the search field, it seems that EFO provides corresponding main terms (potentially mapped via synonyms as for "stomach neoplasm"). This is also true when specifically typing "stomach cancer", which is an alias for "gastric cancer" (and potentially also others). |
Interesting idea. Let me see what we can do. But we already have the pages we imported which already contain aliases. How should we deal with these existing pages? |
Sorry for the delay in replying. I am not sure I fully understand the practical implications here. Why would the imported pages pose a problem? |
I also just took another look into the exported experiments.csv and was happy to see that the gaps in the EFO ID / UBERON ID columns are getting smaller. However, we seem to still have some issues for valid conditions and body sites that were part of the import. Conditions with empy EFO ID column, although they represent valid glossary terms Genital neoplasm, female (there seems to be hiccup due to the ", " notation, see eg https://bugsigdb.org/Study_305) Body sites with empty UBERON ID column, although they represent valid glossary terms: breast (is turned into "obsolete_mammary gland", see eg https://bugsigdb.org/Study_9) P.S.: Editing conditions and body sites in the experiment form works great, but autocomplete options take a while to load. Any chance this can be faster than that? |
Yes, it looks like the comma is the culprit. Will fix.
It looks to me like the likely culprit is that the wiki is finding a Glossary term that is listed as a Condition even though this is the Body site field. We'll limit the search and see if that fixes all of these.
Previously all the values were being loaded at the time the form was first loaded, which meant that autocomplete was very quick. Once we imported the full set of these fields' values, we had to change the autocomplete method we were using. Instead of loading them all at the time the form was being loaded (which would no longer be viable), we're now loading them via AJAX while the user types in each field. It is slow though. Let me see what we can do. |
The wiki treats all data the same, regardless of if it comes from the CSV import, or if it is manually inserted by a user. So for the imported pages, the wiki will perform the same alias lookup. In other words, if a Main term wasn't entered in the CSV, the wiki will search on-the-fly to find a matching Main term that has the entered term as one of its aliases. So what happens if multiple matching Main terms are found that all have the entered term as one of its aliases. We need to figure out which Main term should be displayed and store in the Condition/Body site field. Does that make sense? |
Yes that makes sense. Let me start with saying that I don't think showing multiple matching terms is a bad solution. In the end, these are coming in via existing aliases and there are certainly situations, where showing multiple main terms better describes the condition than just one main term. For example, we found "(one sort of) cancer" / "(one sort of) carcinoma" being used rather exchangably in the literature, and instead of having two sets of signatures (ie one for "(one sort of) cancer" and one for "(one sort of) carcinoma"), it simplifies downstream analysis if those come grouped together right from the start. That said, do you internally distinguish between main terms and aliases? Or do you throw them all into one pot when comparing against the conditions / body sites that were entered by the curators into the experiment forms? If you distinguish, one solution might be to first check whether there is a main term that exactly matches the condition / body site entered by a curator on the experiment form. If so, then take the main term. If there is no exactly matching main term, go the alias route, and here there might be indeed multiple matching main terms, and here there will be no best solution, and showing all would be just fine I think. Taking the example of "urinary tract infection' as in #107; here we have an exact matching main term urinary tract infection which could be displayed. There is not necessarily a need for going down the alias route and also bringing in "bacterial urinary tract infection". |
This should now be fixed. |
Right now I don't think there is any internal difference between entered main terms and aliases. If an alias is entered, we just store the main term anyway. How should we proceed then? (It sounded like your instructions were only intended if we did distinguish.) |
I think if the aforementioned issues can be fixed, we are actually in a good spot with the solution that you've put in place. |
Everything look better now. Any filled body site field now has a matching UBERON ID. The only conditions with missing EFO ID columns are:
|
Looks like we are getting close! Taking a look at the exported
In addition, I am seeing a handful of instances / experiments where a valid glossary term has an empty EFO ID column, although it is non-empty for the same glossary term for other instances / experiments: Instances of conditions with empy EFO ID column, although they represent valid glossary terms and are matched to the corresponding glossary term for other instances / experiments: obesity (Study 46, Experiment 1) |
Thanks! Probably just needs a data rebuild after some new changes we made. We'll check. |
Great, thanks! |
Just did a fresh export of the experiments.csv - and everything looked good, no more missing items! I believe we can close this. |
Great! |
@tosfos I'd like to filter signatures exported from BugSigDB by review status for #80.
This seems currently not possible as the only column that we have in the export csv files on review status is
Revision editor
.However, the field
Revision editor
has value "WikiWorks743" for reviewed content (see eg https://bugsigdb.org/Study_255) as well as content that still needs to be reviewed (see eg https://bugsigdb.org/Study_400). Can this be changed so that this column is empty or NA in the export for content that still need to be reviewed?The text was updated successfully, but these errors were encountered: