Move products_tags to Postgres to improve performance #8676

john-gom · 2023-07-10T13:21:10Z

Description

Many aggregate queries time-out, even when using the products_tags collection in MongoDB. Initial tests indicate that Postgres would offer much better performance for this type of query.

Acceptance criteria

Aggregate queries run more quickly
Results are the same as when using MongoDB

What would a demo look like

Show improved query performance

Notes

Strategy is to create a new openfoodfacts-data repo to wrap the data storage functionality. This would periodically query MongoDB for changed products (needs to check products_obsolete too) and update a cache in Postgres.

Can hopefully use the minon database for this, but might want to consider something separate.

Future work could include joining to taxonomies in the query to avoid the need for tag matching in Perl.

Tasks

Create off-data repo
Create dockerised version of off-data with a dev dependency on MongoDB and Postgres
Perform one-off import of product data from MongoDB
Update off-server to pull in the off-data image as a dependency
Verify can call through to off-data from off-server for an aggregate query
Implement query and verify improved performance
Implement remaining queries
Remove references to products_tags

Part of epic

Consider moving to a relational data model, like Postgres #8620

john-gom · 2023-07-24T17:01:14Z

May want to consider using Citus. https://github.com/citusdata/citus

john-gom · 2023-07-30T15:31:50Z

Evaluating Citus vs DuckDB. First, export all the tags into a CSV file. To import into Postgres:

CREATE table product_tag_columnar ( product_id text NULL, tag_type text NULL, "sequence" int4 NULL, value text NULL ) using columnar;

copy product_tag_columnar from '/mnt/data/product_tags.csv' (format 'csv');

Import took 10 minutes, 52 seconds.

DuckDB
CREATE table product_tag_columnar ( product_id varchar, tag_type varchar, "sequence" integer, value varchar );

copy product_tag_columnar from 'product_tags.csv';
Import took 2 minutes, 38 seconds

Queries:

select value,count(*) from product_tag_columnar where tag_type ='nutrition_grades_tags' group by value order by 1;

DuckDB: 16s, Citus: 39s
Second run: DuckDB: 16s, Citus 39s

select value,count(*) from product_tag_columnar pt where tag_type ='nutrition_grades_tags' and exists (select * from product_tag_columnar pt2 where pt2.product_id = pt.product_id and pt2.tag_type='ecoscore_tags' and pt2.value = 'd') group by value order by 1;

DuckDB: 50s, Citus: 87s
DuckDB: 50s, Citus 84s

The Citus figures are a lot slower than before. Investigating...

john-gom · 2023-07-30T19:38:39Z

Tried inserting into Citus in a different order:
insert into product_tag_2 select * from product_tag_columnar order by 2,4,1;
Took 7 minutes 37 seconds to insert.
Now statement 1 took: 0.8s
Statement 2: 1.6s
So structure of index makes a big difference

john-gom · 2023-07-31T08:05:45Z

For DuckDB, re-exported data sorted by tag_type and value and re-imported into a fresh database. Import took 2 minutes 56 seconds.
Statement 1: 220ms
Statement 2: 270ms
So order of data also makes a big difference for DuckDB

john-gom · 2023-07-31T09:27:09Z

Tried another approach with Postgres - create a table for each tag type. Just created the two being used:
create table off.product_nutrition_grades_tags (product_id text,value text, constraint "product_nutrition_grades_tags_pkey" primary key (product_id,value))
insert into product_nutrition_grades_tags select distinct product_id,value from product_tag_columnar where tag_type='nutrition_grades_tags' and product_id is not null and value is not null;
create table off.product_ecoscore_tags (product_id text,value text, constraint "product_ecoscore_tags_pkey" primary key (product_id,value));
insert into product_ecoscore_tags select distinct product_id,value from product_tag_columnar where tag_type='ecoscore_tags' and product_id is not null and value is not null;

Query 1 is now:
select value,count(*) from product_nutrition_grades_tags group by value order by 1;
Took 500ms

Query 2:
select value,count(*) from product_nutrition_grades_tags pt where exists (select * from product_ecoscore_tags pt2 where pt2.product_id = pt.product_id and pt2.value = 'd') group by value order by 1;
Took 760ms

john-gom · 2023-08-21T07:48:23Z

teolemon mentioned this issue Jul 31, 2023

🚅 Performance (Tracker) #8764

Open

john-gom mentioned this issue Sep 4, 2023

feat: Use openfoodfacts-query service for facet queries instead of product_tags collection #8947

Merged

2 tasks

john-gom closed this as completed in #8947 Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move products_tags to Postgres to improve performance #8676

Move products_tags to Postgres to improve performance #8676

john-gom commented Jul 10, 2023 •

edited

Loading

john-gom commented Jul 24, 2023

john-gom commented Jul 30, 2023 •

edited

Loading

john-gom commented Jul 30, 2023

john-gom commented Jul 31, 2023

john-gom commented Jul 31, 2023 •

edited

Loading

john-gom commented Aug 21, 2023 •

edited

Loading

Move products_tags to Postgres to improve performance #8676

Move products_tags to Postgres to improve performance #8676

Comments

john-gom commented Jul 10, 2023 • edited Loading

Description

Acceptance criteria

What would a demo look like

Notes

Tasks

Part of epic

john-gom commented Jul 24, 2023

john-gom commented Jul 30, 2023 • edited Loading

john-gom commented Jul 30, 2023

john-gom commented Jul 31, 2023

john-gom commented Jul 31, 2023 • edited Loading

john-gom commented Aug 21, 2023 • edited Loading

john-gom commented Jul 10, 2023 •

edited

Loading

john-gom commented Jul 30, 2023 •

edited

Loading

john-gom commented Jul 31, 2023 •

edited

Loading

john-gom commented Aug 21, 2023 •

edited

Loading