-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚸 Improve search #2135
🚸 Improve search #2135
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2135 +/- ##
==========================================
+ Coverage 92.53% 92.63% +0.10%
==========================================
Files 55 55
Lines 6467 6508 +41
==========================================
+ Hits 5984 6029 +45
+ Misses 483 479 -4 ☔ View full report in Codecov by Sentry. |
This single example looks fantastic! I'm just worried that it will deteriorate other cases. Can you run this against @Koncopd's benchmarking framework? And then we have a before after comparison that includes a wider array of search cases and a report that underlies the decision published to LaminHub. |
9b5a18d
to
177b2f8
Compare
Updated screenshots with the examples @Koncopd had, let me know if there's anything else I can test. |
@sunnyosun @falexwolf here is the benchmark for this PR #2141 |
These are great improvements! I wish there was good way in laminhub to document such changes but I guess there isn't for now. |
@fredericenard, please also take a look here. And then we'll discuss in the benchmarking PR how we proceed with organizing code across lamindb, bionty and laminhub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that's totally not covered is longer search phrases along the lines of "CD8-positie cytokine T cell" as highlighted at the very top of the original issue: #1708
So, I believe we should add such a case.
I also know that @sunnyosun had many more cases in her original benchmark: https://github.com/laminlabs/lamindb-benchmarks/blob/main/docs/2023/010-rapidfuzz-search.ipynb
Added more searches to the benchmark. |
Fixed synonyms. The only thing is that it's getting a bit slower now 0.7-0.8s per search (before 0.2-0.3s) because of all the different layers. (us-west-2 instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool!
I know that no review of me was requested but I thought I'd leave a couple of comments still. Feel free to ignore them!
- Would it be possible to add 1-3 sentences of high level motivation and explanation of this new layered algorithm to the docstring or where appropriate, please?
- We have the nice search benchmark now. Is there a way to add slightly more sophisticated tests to ensure that no changes of this code will lead to regressions of the search performance? @Koncopd benchmarking notebook is great but it's not regularly run, is it?
These things can also be added later if you think that they're useful...
) | ||
|
||
def tokenize_search_string(search_str: str) -> list[str]: | ||
# Split the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Split the string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self documenting code here.
Ohh and can we get rid of this now? |
Yes, removing will be the goal. @Koncopd is making a push to consolidate all 3 search algorithms into one clearly documented and benchmarked solution. Sunny's PR here has good ideas but it's not suitable for the hub due to performance. So, the hope is Sergei can replicate the UX with server-side code (essentially correctly using postgres plugins). We can still use Sunny's code for dataframes and sqlite; it'll give the same results but just run slower which is OK in that context. |
Key improvements:
startswith
and isolated phrases (e.g. "naive B cell", "B cell, ..." over "club cell" when searching "b cell")Note:
centrocyte
appears in the "b cell" search because there's a perfect match of "B cell" in the description, same for "t cell" results.