Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lshforest.py #215

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

IbraheemTaha
Copy link

@IbraheemTaha IbraheemTaha commented Aug 8, 2023

In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary in python.

The use is only for the dictionary keys which is eventually returned as a list.

Example:

data_sets = [
[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[2, 3, 5, 6, 7]
]

query_set = [1, 2, 3, 4, 5, 6]

The actual Jaccard between query_set and each of the data sets as follows:
(0) 0.8333333333333334
(1) 0.375
(2) 0.5714285714285714

Therefore the returned order based on the actual Jaccard score is [0,2,1]
However, the original implementation with set data structure returns [0,1,2], although the search function is correct.

After my enhancement, the returned list is [0,2,1] (more accurate than before).

This will significantly improve the implementation of LSH-Forest.

In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary. 

The use is only for the dictionary keys which is eventually returned as a list.
@IbraheemTaha
Copy link
Author

Here is a result of small experiment to show the improvement of the results of MAP@K where the truth values are from Jaccard Similarity scores.
Old vs New

@ekzhu ekzhu self-requested a review August 9, 2023 16:33
@ekzhu
Copy link
Owner

ekzhu commented Aug 9, 2023

Thank you! Your observation indicates that we should order the results by increasing order of their first appearance. What people used to do typically after retrieving the top-k results is to recalculate the estimated Jaccard of the returned results using MinHash stored on the side, and then re-rank the results according to the estimated Jaccard.

Does the new implementation remove the needs to re-rank the results? What is the performance implication of using dictionary keys over set? There is a benchmark suite in benchmark/indexes/jaccard where we attempt to measure these things using real datasets.

@IbraheemTaha
Copy link
Author

Thank you!
Yes, in the paper there are two phases of prefix trees searching 1) top-down 2) bottom-up. So, if the search is done correctly (which I have observed so far), the more similar the stored MinHash is, the quicker it will appear in the search, thus matching the prefix of the query MinHash.

I am not able to answer your questions for now, because I have not done the tests yet.

Best!

@ekzhu
Copy link
Owner

ekzhu commented Aug 17, 2023

@IbraheemTaha this is interesting. Is there a specific guideline around what is the smallest prefix at which point we should stop probing the hash table? I guess at some point the accuracy will be quite bad.

@IbraheemTaha
Copy link
Author

@ekzhu I cannot tell for now. However, the search proposed in the paper is done in two phases.

  1. Top-Down: for all trees (tables) this search going down to find the leaf with the largest prefix match with the query, saying reaches level x.

  2. Bottom-Up: moving up from level x to the root, again for all trees (tables), collecting all points (similar items), of course these are put in a set so no duplications (or dictionaries keys as in my improvement).

If your question is how to stop the search before reaching out the root, then there should be kind of a flag to stop based on the number of points in the results meanwhile searching and stop when the result set equals to k, more precisely in the function _query.

I hope this answers your question.

@ekzhu
Copy link
Owner

ekzhu commented Mar 11, 2024

@IbraheemTaha Sorry for the extremely belated response. My question was how does your change compare to the approach of retrieving the top-k and then re-rank using the MinHash of the results? There is a recent PR #234 added new functionality to retrieve MinHash from LSHForest index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants