[PB-2628, BR-492]: fix/Improve file search ranking #394

jzunigax2 · 2024-09-16T22:44:16Z

The search used for fuzzy search was not finding the correct file in cases were many files existed with very similar name patterns, for example VID_YYYYMMDD_XXXXX. Even when searching for the exact file name it was not included in the 5 search results returned.

Added to ts_rank params the first type of normalization, which should slightly penalize larger file names with overly generic names. (more here)
Added a check to prioritze exact file name matches over any other result

Before / After

sg-gs · 2024-09-17T08:24:01Z

Hey, @jzunigax2 would you mind running EXPLAIN ANALYZE before and after with around 250k example records for the same user to see the overall performance impact of this modification?

jzunigax2 · 2024-09-17T21:01:11Z

Hey, @jzunigax2 would you mind running EXPLAIN ANALYZE before and after with around 250k example records for the same user to see the overall performance impact of this modification?

@sg-gs, this is with 250k records

Before:

QUERY PLAN
Limit (cost=40836.62..40842.71 rows=5 width=246) (actual time=1948.695..1959.511 rows=5 loops=1)
-> Nested Loop Left Join (cost=40836.62..142345.44 rows=83342 width=246) (actual time=1948.694..1959.508 rows=5 loops=1)
-> Gather Merge (cost=40836.20..50542.75 rows=83342 width=167) (actual time=1948.594..1959.254 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=39836.18..39923.00 rows=34726 width=167) (actual time=1938.160..1938.220 rows=212 loops=3)
Sort Key: (NULLIF(ts_rank("LookUpModel".tokenized_name, to_tsquery('VID_20240126_WA46479_83954'::text)), '0'::double precision)), (similarity(("LookUpModel".name)::text, 'VID_20240126_WA46479_83954'::text)) DESC
Sort Method: external merge Disk: 15816kB
Worker 0: Sort Method: external merge Disk: 15672kB
Worker 1: Sort Method: external merge Disk: 15664kB
-> Hash Left Join (cost=11.35..34368.19 rows=34726 width=167) (actual time=1.263..1838.299 rows=83334 loops=3)
Hash Cond: ("LookUpModel".item_id = folder.uuid)
-> Parallel Seq Scan on look_up "LookUpModel" (cost=0.00..34265.69 rows=34726 width=163) (actual time=1.018..941.407 rows=83334 loops=3)
Filter: (((user_id)::text = '84be5af3-48cc-49c6-8e06-da0922540093'::text) AND ((to_tsquery('VID_20240126_WA46479_83954'::text) @@ tokenized_name) OR (similarity((name)::text, 'VID_20240126_WA46479_83954'::text) > '0'::double precision)))
Rows Removed by Filter: 111
-> Hash (cost=10.60..10.60 rows=60 width=20) (actual time=0.086..0.088 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on folders folder (cost=0.00..10.60 rows=60 width=20) (actual time=0.024..0.066 rows=109 loops=3)
-> Index Scan using files_uuid_key on files file (cost=0.42..0.84 rows=1 width=87) (actual time=0.028..0.029 rows=1 loops=5)
Index Cond: (uuid = "LookUpModel".item_id)
Planning Time: 0.890 ms
Execution Time: 1963.199 ms

After

QUERY PLAN
Limit (cost=40836.62..40842.73 rows=5 width=250) (actual time=1896.840..1907.764 rows=5 loops=1)
-> Nested Loop Left Join (cost=40836.62..142553.80 rows=83342 width=250) (actual time=1896.839..1907.760 rows=5 loops=1)
-> Gather Merge (cost=40836.20..50542.75 rows=83342 width=167) (actual time=1896.740..1907.508 rows=5 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=39836.18..39923.00 rows=34726 width=167) (actual time=1889.491..1889.577 rows=208 loops=3)
Sort Key: (CASE WHEN (("LookUpModel".name)::text = 'VID_20240126_WA46479_83954'::text) THEN 1 ELSE 0 END) DESC, (NULLIF(ts_rank("LookUpModel".tokenized_name, to_tsquery('VID_20240126_WA46479_83954'::text)), '1'::double precision)), (similarity(("LookUpModel".name)::text, 'VID_20240126_WA46479_83954'::text)) DESC
Sort Method: external merge Disk: 16128kB
Worker 0: Sort Method: external merge Disk: 15816kB
Worker 1: Sort Method: external merge Disk: 16152kB
-> Hash Left Join (cost=11.35..34368.19 rows=34726 width=167) (actual time=0.537..1775.411 rows=83334 loops=3)
Hash Cond: ("LookUpModel".item_id = folder.uuid)
-> Parallel Seq Scan on look_up "LookUpModel" (cost=0.00..34265.69 rows=34726 width=163) (actual time=0.386..905.687 rows=83334 loops=3)
Filter: (((user_id)::text = '84be5af3-48cc-49c6-8e06-da0922540093'::text) AND ((to_tsquery('VID_20240126_WA46479_83954'::text) @@ tokenized_name) OR (similarity((name)::text, 'VID_20240126_WA46479_83954'::text) > '0'::double precision)))
Rows Removed by Filter: 111
-> Hash (cost=10.60..10.60 rows=60 width=20) (actual time=0.067..0.068 rows=109 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on folders folder (cost=0.00..10.60 rows=60 width=20) (actual time=0.017..0.050 rows=109 loops=3)
-> Index Scan using files_uuid_key on files file (cost=0.42..0.84 rows=1 width=87) (actual time=0.023..0.023 rows=1 loops=5)
Index Cond: (uuid = "LookUpModel".item_id)
Planning Time: 6.087 ms
Execution Time: 1911.745 ms

jzunigax2 · 2024-09-17T21:28:11Z

Also @apsantiso passed me another error report related to this search functionality and after taking a look it turns out that the more specific a search term got the search results became worse. This was due ordering by rank ASC, it seems that the lower the ts_rank output the worse match it is. Simply ordering by rank DESC greatly improves search results.

@sg-gs with this the new exactMatch check could be unnecessary but as seen in the execution plans it has little impact on performance, let me know if I should revert or keep it.

sonarcloud · 2024-09-18T13:13:10Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

sg-gs · 2024-09-18T13:22:48Z

There is virtually no impact on the performance, as long as it passes the QA, this can be merged. Great job @jzunigax2

jzunigax2 added 2 commits September 16, 2024 13:02

fix: Improve file search ranking and include creation timestamps

6a11b67

fix: remove createdAt references

53639f9

jzunigax2 requested a review from sg-gs September 16, 2024 22:44

jzunigax2 self-assigned this Sep 16, 2024

jzunigax2 requested a review from apsantiso as a code owner September 16, 2024 22:44

jzunigax2 had a problem deploying to development September 16, 2024 22:44 — with GitHub Actions Failure

jzunigax2 temporarily deployed to development September 16, 2024 22:44 — with GitHub Actions Inactive

github-actions bot added the ready-for-preview label Sep 16, 2024

jzunigax2 changed the title ~~fix: Improve file search ranking and include creation timestamps~~ [PB-2628]: fix/Improve file search ranking and include creation timestamps Sep 17, 2024

jzunigax2 changed the title ~~[PB-2628]: fix/Improve file search ranking and include creation timestamps~~ [PB-2628]: fix/Improve file search ranking Sep 17, 2024

fix: order by rank desc

fbf4abf

jzunigax2 had a problem deploying to development September 17, 2024 21:09 — with GitHub Actions Failure

jzunigax2 temporarily deployed to development September 17, 2024 21:09 — with GitHub Actions Inactive

jzunigax2 changed the title ~~[PB-2628]: fix/Improve file search ranking~~ [PB-2628, BR-492]: fix/Improve file search ranking Sep 17, 2024

Merge branch 'master' into fix/fuzzy-search-relevance

f7cc6df

jzunigax2 had a problem deploying to development September 18, 2024 13:10 — with GitHub Actions Failure

jzunigax2 temporarily deployed to development September 18, 2024 13:10 — with GitHub Actions Inactive

sg-gs approved these changes Sep 18, 2024

View reviewed changes

sg-gs added the bug Something isn't working label Sep 18, 2024

sg-gs merged commit 4de1781 into master Sep 18, 2024
10 of 11 checks passed

sg-gs deleted the fix/fuzzy-search-relevance branch September 18, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PB-2628, BR-492]: fix/Improve file search ranking #394

[PB-2628, BR-492]: fix/Improve file search ranking #394

jzunigax2 commented Sep 16, 2024

sg-gs commented Sep 17, 2024 •

edited

Loading

jzunigax2 commented Sep 17, 2024

jzunigax2 commented Sep 17, 2024

sonarcloud bot commented Sep 18, 2024

sg-gs commented Sep 18, 2024

[PB-2628, BR-492]: fix/Improve file search ranking #394

[PB-2628, BR-492]: fix/Improve file search ranking #394

Conversation

jzunigax2 commented Sep 16, 2024

Before / After

sg-gs commented Sep 17, 2024 • edited Loading

jzunigax2 commented Sep 17, 2024

Before:

After

jzunigax2 commented Sep 17, 2024

sonarcloud bot commented Sep 18, 2024

Quality Gate passed

sg-gs commented Sep 18, 2024

sg-gs commented Sep 17, 2024 •

edited

Loading