Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display of Accession in search results - duplicates? #30

Open
goldturtle opened this issue Jun 18, 2019 · 11 comments
Open

Display of Accession in search results - duplicates? #30

goldturtle opened this issue Jun 18, 2019 · 11 comments

Comments

@goldturtle
Copy link
Contributor

@vanaukenk commented on Aug 17, 2015

What determines which paper accession is displayed in the search results?

A keyword search on 'mut-7' lists both WBPaper ID and PMIDs, even though the PMIDs have corresponding WBPaper IDs.

Sometimes the same sentence appears listed under each accession separately, but the sentence actually has a different score depending on the accession.

As an example, a search with 'MUT-7' and the 'mf enz activ assay' and 'mf enz activ verbs' categories lists, as the third and fourth entries, the same sentence with scores of 0.611 and 0.592, respectively, for WBPaper00024699 and PMID 15653635.

Thx.
--Kimberly

@goldturtle
Copy link
Contributor Author

@vanaukenk commented on Feb 29, 2016

Duplicate papers are still being returned with searches; one of the entries has all of the relevant IDs, the other only the PMID. The search scores are different for each entry.

@goldturtle
Copy link
Contributor Author

@goldturtle commented on Feb 29, 2016

This smells like the paper is in the PMCOA corpus as well as the C.
elegans corpus, but Yuling would need to investigate. As the papers are
tokenized differently, the score differs.

M.

@goldturtle
Copy link
Contributor Author

@vanaukenk commented on Feb 29, 2016

Yes, that makes sense. When there are duplicates, though, which one should be returned? Note also that the PMID only papers display formatting when you click on the arrow to see the sentences:

@goldturtle
Copy link
Contributor Author

@vanaukenk commented on Feb 29, 2016

For testing purposes, this is the search that I performed to get these results:
Search scope: sentence
Keywords: DYN-1
Categories (match all): MFEA assay terms
MFEA verbs

@goldturtle
Copy link
Contributor Author

@vanaukenk commented on Jul 1, 2016

What is the current status of this issue wrt the C. elegans corpus? Screenshots of searches of the C. elegans corpus still display duplicate papers. Did we decide that we would go with the PMCOA version if it existed, and if not, then use the PDF version of the paper?
Also, a related issue, how do we want to handle supplemental files? It doesn't look like PMCOA contains the supplemental material, but I don't know if that's universally true.
When they are available, some labeling of supplemental materials would help indicate where the additional results are from.

@vanaukenk
Copy link

@goldturtle @valearna

I was doing some searches of the C. elegans Textpresso site and it looks like the duplicate paper problem is becoming even more pervasive:

image

I don't see this on the main Textpresso site, although the search results are very different there from the C. elegans site, as expected. Am I searching the correct site?

@goldturtle
Copy link
Contributor Author

goldturtle commented Apr 15, 2021 via email

@vanaukenk
Copy link

Okay, that makes more sense now. Thanks for pointing that out.

Should the results for checking 'C. elegans' AND 'C. elegans supplementals' be the same, then, as just selecting 'C. elegans and Supplementals? I didn't see that in the search that I'm doing, but maybe there is another reason for that?

Perhaps the default literature setting should be to check just 'C. elegans and supplementals' and then users could narrow that to either category if they want to. We could see what people think on the Textpresso call.

@textpresso
Copy link

textpresso commented Apr 15, 2021 via email

@vanaukenk
Copy link

Okay, got it. Thanks.

@goldturtle
Copy link
Contributor Author

The default literature is now C. elegans for people who are not logged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants