Update lshforest.py #215

IbraheemTaha · 2023-08-08T18:38:38Z

In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary in python.

The use is only for the dictionary keys which is eventually returned as a list.

Example:

data_sets = [
[1, 2, 3, 4, 5],
[4, 5, 6, 7, 8],
[2, 3, 5, 6, 7]
]

query_set = [1, 2, 3, 4, 5, 6]

The actual Jaccard between query_set and each of the data sets as follows:
(0) 0.8333333333333334
(1) 0.375
(2) 0.5714285714285714

Therefore the returned order based on the actual Jaccard score is [0,2,1]
However, the original implementation with set data structure returns [0,1,2], although the search function is correct.

After my enhancement, the returned list is [0,2,1] (more accurate than before).

This will significantly improve the implementation of LSH-Forest.

In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary. The use is only for the dictionary keys which is eventually returned as a list.

IbraheemTaha · 2023-08-08T18:47:45Z

Here is a result of small experiment to show the improvement of the results of MAP@K where the truth values are from Jaccard Similarity scores.

ekzhu · 2023-08-09T16:45:04Z

Thank you! Your observation indicates that we should order the results by increasing order of their first appearance. What people used to do typically after retrieving the top-k results is to recalculate the estimated Jaccard of the returned results using MinHash stored on the side, and then re-rank the results according to the estimated Jaccard.

Does the new implementation remove the needs to re-rank the results? What is the performance implication of using dictionary keys over set? There is a benchmark suite in benchmark/indexes/jaccard where we attempt to measure these things using real datasets.

IbraheemTaha · 2023-08-09T18:43:19Z

Thank you!
Yes, in the paper there are two phases of prefix trees searching 1) top-down 2) bottom-up. So, if the search is done correctly (which I have observed so far), the more similar the stored MinHash is, the quicker it will appear in the search, thus matching the prefix of the query MinHash.

I am not able to answer your questions for now, because I have not done the tests yet.

Best!

ekzhu · 2023-08-17T18:23:09Z

@IbraheemTaha this is interesting. Is there a specific guideline around what is the smallest prefix at which point we should stop probing the hash table? I guess at some point the accuracy will be quite bad.

IbraheemTaha · 2023-09-26T15:32:23Z

@ekzhu I cannot tell for now. However, the search proposed in the paper is done in two phases.

Top-Down: for all trees (tables) this search going down to find the leaf with the largest prefix match with the query, saying reaches level x.
Bottom-Up: moving up from level x to the root, again for all trees (tables), collecting all points (similar items), of course these are put in a set so no duplications (or dictionaries keys as in my improvement).

If your question is how to stop the search before reaching out the root, then there should be kind of a flag to stop based on the number of points in the results meanwhile searching and stop when the result set equals to k, more precisely in the function _query.

I hope this answers your question.

ekzhu · 2024-03-11T04:31:45Z

@IbraheemTaha Sorry for the extremely belated response. My question was how does your change compare to the approach of retrieving the top-k and then re-rank using the MinHash of the results? There is a recent PR #234 added new functionality to retrieve MinHash from LSHForest index.

Update lshforest.py

1049285

In the "query" function, I changed the "results" variable type from set into dictionary. Since the order is important, set data structure does not preserve the order unlike dictionary. The use is only for the dictionary keys which is eventually returned as a list.

ekzhu self-requested a review August 9, 2023 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lshforest.py #215

Update lshforest.py #215

IbraheemTaha commented Aug 8, 2023 •

edited

Loading

IbraheemTaha commented Aug 8, 2023

ekzhu commented Aug 9, 2023 •

edited

Loading

IbraheemTaha commented Aug 9, 2023

ekzhu commented Aug 17, 2023

IbraheemTaha commented Sep 26, 2023

ekzhu commented Mar 11, 2024

Update lshforest.py #215

Are you sure you want to change the base?

Update lshforest.py #215

Conversation

IbraheemTaha commented Aug 8, 2023 • edited Loading

IbraheemTaha commented Aug 8, 2023

ekzhu commented Aug 9, 2023 • edited Loading

IbraheemTaha commented Aug 9, 2023

ekzhu commented Aug 17, 2023

IbraheemTaha commented Sep 26, 2023

ekzhu commented Mar 11, 2024

IbraheemTaha commented Aug 8, 2023 •

edited

Loading

ekzhu commented Aug 9, 2023 •

edited

Loading