Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚸 Improve search #2135

Closed
wants to merge 12 commits into from
Closed

🚸 Improve search #2135

wants to merge 12 commits into from

Conversation

sunnyosun
Copy link
Member

@sunnyosun sunnyosun commented Nov 6, 2024

Key improvements:

  • prioritize startswith and isolated phrases (e.g. "naive B cell", "B cell, ..." over "club cell" when searching "b cell")
  • sort by shorter names first

Note: centrocyte appears in the "b cell" search because there's a perfect match of "B cell" in the description, same for "t cell" results.

LaminDB Before LaminDB After Hub
Screenshot 2024-11-07 at 12 24 25 Screenshot 2024-11-07 at 12 42 48 Screenshot 2024-11-07 at 12 21 23
Screenshot 2024-11-07 at 12 24 31 Screenshot 2024-11-07 at 12 43 10 Screenshot 2024-11-07 at 12 22 12
Screenshot 2024-11-07 at 12 24 17 Screenshot 2024-11-07 at 12 43 18 Screenshot 2024-11-07 at 12 22 34
Screenshot 2024-11-07 at 13 10 44 Screenshot 2024-11-07 at 13 10 15 Screenshot 2024-11-07 at 13 11 40
Screenshot 2024-11-07 at 16 06 53 Screenshot 2024-11-07 at 21 43 05 Screenshot 2024-11-07 at 16 11 32
Screenshot 2024-11-07 at 16 09 37 Screenshot 2024-11-07 at 21 43 25 Screenshot 2024-11-07 at 16 11 58

Copy link

codecov bot commented Nov 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.63%. Comparing base (57fbd29) to head (2ee90cb).
Report is 29 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2135      +/-   ##
==========================================
+ Coverage   92.53%   92.63%   +0.10%     
==========================================
  Files          55       55              
  Lines        6467     6508      +41     
==========================================
+ Hits         5984     6029      +45     
+ Misses        483      479       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Nov 6, 2024

@github-actions github-actions bot temporarily deployed to pull request November 6, 2024 11:34 Inactive
@falexwolf
Copy link
Member

This single example looks fantastic!

I'm just worried that it will deteriorate other cases.

Can you run this against @Koncopd's benchmarking framework?

And then we have a before after comparison that includes a wider array of search cases and a report that underlies the decision published to LaminHub.

@github-actions github-actions bot temporarily deployed to pull request November 6, 2024 13:39 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 10:00 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 10:29 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 11:20 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 11:57 Inactive
@sunnyosun
Copy link
Member Author

Updated screenshots with the examples @Koncopd had, let me know if there's anything else I can test.

@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 12:36 Inactive
@Koncopd Koncopd mentioned this pull request Nov 7, 2024
@Koncopd
Copy link
Member

Koncopd commented Nov 7, 2024

@sunnyosun sunnyosun linked an issue Nov 7, 2024 that may be closed by this pull request
@falexwolf falexwolf changed the title ⚡️ Improve search 🚸 Improve search Nov 7, 2024
@falexwolf
Copy link
Member

These are great improvements!

I wish there was good way in laminhub to document such changes but I guess there isn't for now.

@falexwolf
Copy link
Member

@fredericenard, please also take a look here.

And then we'll discuss in the benchmarking PR how we proceed with organizing code across lamindb, bionty and laminhub.

@falexwolf falexwolf self-requested a review November 7, 2024 14:28
Copy link
Member

@falexwolf falexwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that's totally not covered is longer search phrases along the lines of "CD8-positie cytokine T cell" as highlighted at the very top of the original issue: #1708

So, I believe we should add such a case.

I also know that @sunnyosun had many more cases in her original benchmark: https://github.com/laminlabs/lamindb-benchmarks/blob/main/docs/2023/010-rapidfuzz-search.ipynb

@Koncopd
Copy link
Member

Koncopd commented Nov 7, 2024

Added more searches to the benchmark.
test_search_synonyms fails now, commented out in the benchmark.

@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 20:07 Inactive
@sunnyosun
Copy link
Member Author

sunnyosun commented Nov 7, 2024

Fixed synonyms. The only thing is that it's getting a bit slower now 0.7-0.8s per search (before 0.2-0.3s) because of all the different layers. (us-west-2 instance)

@github-actions github-actions bot temporarily deployed to pull request November 7, 2024 21:07 Inactive
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool!

I know that no review of me was requested but I thought I'd leave a couple of comments still. Feel free to ignore them!

  1. Would it be possible to add 1-3 sentences of high level motivation and explanation of this new layered algorithm to the docstring or where appropriate, please?
  2. We have the nice search benchmark now. Is there a way to add slightly more sophisticated tests to ensure that no changes of this code will lead to regressions of the search performance? @Koncopd benchmarking notebook is great but it's not regularly run, is it?

These things can also be added later if you think that they're useful...

)

def tokenize_search_string(search_str: str) -> list[str]:
# Split the string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Split the string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self documenting code here.

@Zethson
Copy link
Member

Zethson commented Nov 8, 2024

Ohh and can we get rid of this now?
https://docs.lamin.ai/query-search
image
🚀

@falexwolf
Copy link
Member

Yes, removing will be the goal.

image

@Koncopd is making a push to consolidate all 3 search algorithms into one clearly documented and benchmarked solution.

Sunny's PR here has good ideas but it's not suitable for the hub due to performance. So, the hope is Sergei can replicate the UX with server-side code (essentially correctly using postgres plugins). We can still use Sunny's code for dataframes and sqlite; it'll give the same results but just run slower which is OK in that context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search is much better on the UI than in the open-source package
4 participants