Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PB-2628, BR-492]: fix/Improve file search ranking #394

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

jzunigax2
Copy link
Contributor

The search used for fuzzy search was not finding the correct file in cases were many files existed with very similar name patterns, for example VID_YYYYMMDD_XXXXX. Even when searching for the exact file name it was not included in the 5 search results returned.

  • Added to ts_rank params the first type of normalization, which should slightly penalize larger file names with overly generic names. (more here)
  • Added a check to prioritze exact file name matches over any other result

Before / After

       

@jzunigax2 jzunigax2 self-assigned this Sep 16, 2024
@jzunigax2 jzunigax2 changed the title fix: Improve file search ranking and include creation timestamps [PB-2628]: fix/Improve file search ranking and include creation timestamps Sep 17, 2024
@jzunigax2 jzunigax2 changed the title [PB-2628]: fix/Improve file search ranking and include creation timestamps [PB-2628]: fix/Improve file search ranking Sep 17, 2024
@sg-gs
Copy link
Member

sg-gs commented Sep 17, 2024

Hey, @jzunigax2 would you mind running EXPLAIN ANALYZE before and after with around 250k example records for the same user to see the overall performance impact of this modification?

@jzunigax2
Copy link
Contributor Author

Hey, @jzunigax2 would you mind running EXPLAIN ANALYZE before and after with around 250k example records for the same user to see the overall performance impact of this modification?

@sg-gs, this is with 250k records

Before:

QUERY PLAN
Limit (cost=40836.62..40842.71 rows=5 width=246) (actual time=1948.695..1959.511 rows=5 loops=1)
-> Nested Loop Left Join (cost=40836.62..142345.44 rows=83342 width=246) (actual time=1948.694..1959.508 rows=5 loops=1)
-> Gather Merge (cost=40836.20..50542.75 rows=83342 width=167) (actual time=1948.594..1959.254 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=39836.18..39923.00 rows=34726 width=167) (actual time=1938.160..1938.220 rows=212 loops=3)
Sort Key: (NULLIF(ts_rank("LookUpModel".tokenized_name, to_tsquery('VID_20240126_WA46479_83954'::text)), '0'::double precision)), (similarity(("LookUpModel".name)::text, 'VID_20240126_WA46479_83954'::text)) DESC
Sort Method: external merge Disk: 15816kB
Worker 0: Sort Method: external merge Disk: 15672kB
Worker 1: Sort Method: external merge Disk: 15664kB
-> Hash Left Join (cost=11.35..34368.19 rows=34726 width=167) (actual time=1.263..1838.299 rows=83334 loops=3)
Hash Cond: ("LookUpModel".item_id = folder.uuid)
-> Parallel Seq Scan on look_up "LookUpModel" (cost=0.00..34265.69 rows=34726 width=163) (actual time=1.018..941.407 rows=83334 loops=3)
Filter: (((user_id)::text = '84be5af3-48cc-49c6-8e06-da0922540093'::text) AND ((to_tsquery('VID_20240126_WA46479_83954'::text) @@ tokenized_name) OR (similarity((name)::text, 'VID_20240126_WA46479_83954'::text) > '0'::double precision)))
Rows Removed by Filter: 111
-> Hash (cost=10.60..10.60 rows=60 width=20) (actual time=0.086..0.088 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on folders folder (cost=0.00..10.60 rows=60 width=20) (actual time=0.024..0.066 rows=109 loops=3)
-> Index Scan using files_uuid_key on files file (cost=0.42..0.84 rows=1 width=87) (actual time=0.028..0.029 rows=1 loops=5)
Index Cond: (uuid = "LookUpModel".item_id)
Planning Time: 0.890 ms
Execution Time: 1963.199 ms

After

QUERY PLAN
Limit (cost=40836.62..40842.73 rows=5 width=250) (actual time=1896.840..1907.764 rows=5 loops=1)
-> Nested Loop Left Join (cost=40836.62..142553.80 rows=83342 width=250) (actual time=1896.839..1907.760 rows=5 loops=1)
-> Gather Merge (cost=40836.20..50542.75 rows=83342 width=167) (actual time=1896.740..1907.508 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=39836.18..39923.00 rows=34726 width=167) (actual time=1889.491..1889.577 rows=208 loops=3)
Sort Key: (CASE WHEN (("LookUpModel".name)::text = 'VID_20240126_WA46479_83954'::text) THEN 1 ELSE 0 END) DESC, (NULLIF(ts_rank("LookUpModel".tokenized_name, to_tsquery('VID_20240126_WA46479_83954'::text)), '1'::double precision)), (similarity(("LookUpModel".name)::text, 'VID_20240126_WA46479_83954'::text)) DESC
Sort Method: external merge Disk: 16128kB
Worker 0: Sort Method: external merge Disk: 15816kB
Worker 1: Sort Method: external merge Disk: 16152kB
-> Hash Left Join (cost=11.35..34368.19 rows=34726 width=167) (actual time=0.537..1775.411 rows=83334 loops=3)
Hash Cond: ("LookUpModel".item_id = folder.uuid)
-> Parallel Seq Scan on look_up "LookUpModel" (cost=0.00..34265.69 rows=34726 width=163) (actual time=0.386..905.687 rows=83334 loops=3)
Filter: (((user_id)::text = '84be5af3-48cc-49c6-8e06-da0922540093'::text) AND ((to_tsquery('VID_20240126_WA46479_83954'::text) @@ tokenized_name) OR (similarity((name)::text, 'VID_20240126_WA46479_83954'::text) > '0'::double precision)))
Rows Removed by Filter: 111
-> Hash (cost=10.60..10.60 rows=60 width=20) (actual time=0.067..0.068 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on folders folder (cost=0.00..10.60 rows=60 width=20) (actual time=0.017..0.050 rows=109 loops=3)
-> Index Scan using files_uuid_key on files file (cost=0.42..0.84 rows=1 width=87) (actual time=0.023..0.023 rows=1 loops=5)
Index Cond: (uuid = "LookUpModel".item_id)
Planning Time: 6.087 ms
Execution Time: 1911.745 ms

@jzunigax2
Copy link
Contributor Author

Also @apsantiso passed me another error report related to this search functionality and after taking a look it turns out that the more specific a search term got the search results became worse. This was due ordering by rank ASC, it seems that the lower the ts_rank output the worse match it is. Simply ordering by rank DESC greatly improves search results.

@sg-gs with this the new exactMatch check could be unnecessary but as seen in the execution plans it has little impact on performance, let me know if I should revert or keep it.

@jzunigax2 jzunigax2 changed the title [PB-2628]: fix/Improve file search ranking [PB-2628, BR-492]: fix/Improve file search ranking Sep 17, 2024
Copy link

sonarcloud bot commented Sep 18, 2024

@sg-gs
Copy link
Member

sg-gs commented Sep 18, 2024

There is virtually no impact on the performance, as long as it passes the QA, this can be merged. Great job @jzunigax2

@sg-gs sg-gs added the bug Something isn't working label Sep 18, 2024
@sg-gs sg-gs merged commit 4de1781 into master Sep 18, 2024
10 of 11 checks passed
@sg-gs sg-gs deleted the fix/fuzzy-search-relevance branch September 18, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready-for-preview
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants