Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize map marker fetching #1171

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

CollinBeczak
Copy link
Contributor

@CollinBeczak CollinBeczak commented Feb 25, 2025

This pull request refactors the code for the tasks/box and tasks/markers endpoints, optimizing the underlying SQL query for better performance. The original query combined spatial filtering with other conditions directly in the WHERE clause, often resulting in inefficient execution plans—such as full-table scans or the creation of a large bitmap hash map spanning the entire tasks table within the specified map bounds. This inefficiency is evident in the staging query, which took approximately 1.1 seconds to execute and processed an excessive number of rows during the spatial filter, as shown in its execution plan.

Example of old query in staging:

SELECT tasks.id, tasks.name, tasks.parent_id, c.name, tasks.instruction, tasks.status, tasks.mapped_on,
          tasks.completed_time_spent, tasks.completed_by,
          tasks.bundle_id, tasks.is_bundle_primary, tasks.cooperative_work_json::TEXT as cooperative_work,
          task_review.review_status, task_review.review_requested_by, task_review.reviewed_by, task_review.reviewed_at,
          task_review.review_started_at, task_review.meta_review_status, task_review.meta_reviewed_by,
          task_review.meta_reviewed_at, task_review.additional_reviewers,
          ST_AsGeoJSON(tasks.location) AS location, priority,
          CASE WHEN task_review.review_started_at IS NULL
                THEN 0
                ELSE EXTRACT(epoch FROM (task_review.reviewed_at - task_review.review_started_at)) END
          AS reviewDuration
    FROM tasks
    
        INNER JOIN challenges c ON c.id = tasks.parent_id
        INNER JOIN projects p ON p.id = c.parent_id
        LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
      
   WHERE (tasks.id NOT IN (select item_id from locked WHERE
                  item_id = tasks.id AND item_type = 2
                  AND user_id != 1)) AND (c.deleted = false AND p.deleted = false) AND (tasks.location && ST_MakeEnvelope (-115.57617187500001, 32.54681317351517, -87.5390625, 52.908902047770255, 4326)) AND (tasks.status IN (0,1,2,3,4,5,6,9)) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE task_review.task_id = tasks.id AND task_review.review_status IN (0,1,2,3,4,5,6,7,-1)) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE (task_review.task_id = tasks.id) AND ((task_review.meta_review_status IN (0,1,2,3,5,6,7,-2,-1) OR task_review.meta_review_status IS NULL))) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND (tasks.priority IN (0,1,2)) AND (c.id IN (40400)) LIMIT 1001;

This is its reads/workflow (take not of the number of rows read in the location filter):
Screenshot 2025-02-27 at 1 22 25 PM

And takes roughly 1.1seconds to complete.

This is the new query:

WITH filtered_tasks AS (
          SELECT tasks.id
          FROM tasks
          INNER JOIN challenges c ON c.id = tasks.parent_id
          INNER JOIN projects p ON p.id = c.parent_id
          LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
          WHERE (tasks.id NOT IN (select item_id from locked WHERE
                  item_id = tasks.id AND item_type = 2
                  AND user_id != 1)) AND (c.deleted = false AND p.deleted = false) AND (tasks.status IN (0,1,2,3,4,5,6,9)) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE task_review.task_id = tasks.id AND task_review.review_status IN (0,1,2,3,4,5,6,7,-1)) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE (task_review.task_id = tasks.id) AND ((task_review.meta_review_status IN (0,1,2,3,5,6,7,-2,-1) OR task_review.meta_review_status IS NULL))) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND (tasks.priority IN (0,1,2)) AND (c.id IN (40400))
        )
        SELECT  tasks.id,
    tasks.name,
    tasks.parent_id,
    c.name,
    tasks.instruction,
    tasks.status,
    tasks.mapped_on,
    tasks.completed_time_spent,
    tasks.completed_by,
    tasks.bundle_id,
    tasks.is_bundle_primary,
    tasks.cooperative_work_json::TEXT as cooperative_work,
    task_review.review_status,
    task_review.review_requested_by,
    task_review.reviewed_by,
    task_review.reviewed_at,
    task_review.review_started_at,
    task_review.meta_review_status,
    task_review.meta_reviewed_by,
    task_review.meta_reviewed_at,
    task_review.additional_reviewers,
    ST_AsGeoJSON(tasks.location) AS location,
    priority,
    CASE 
        WHEN task_review.review_started_at IS NULL THEN 0
        ELSE EXTRACT(epoch FROM (task_review.reviewed_at - task_review.review_started_at))
    END AS reviewDuration
        FROM filtered_tasks
        INNER JOIN tasks ON tasks.id = filtered_tasks.id
        INNER JOIN challenges c ON c.id = tasks.parent_id
        INNER JOIN projects p ON p.id = c.parent_id
        LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
        WHERE tasks.location && ST_MakeEnvelope(-115.57617187500001, 32.54681317351517, -87.5390625, 52.908902047770255, 4326)
        LIMIT 1001
      ;

This is its reads/workflow:
Screenshot 2025-02-27 at 1 22 25 PM

and it takes roughly 150-200ms.

The refactored approach introduces a Common Table Expression (CTE) to streamline the process. By pre-filtering tasks based on non-spatial conditions (e.g., status, priority, and review criteria) in the CTE, the query reduces the dataset before applying the spatial filter. This significantly cuts down the rows processed by the spatial operation, leading to a more efficient execution plan. The updated query, now completing in 150–200 milliseconds, leverages a CTE scan to ensure the spatial filter (ST_MakeEnvelope) operates only on the pre-filtered subset of tasks, avoiding the bloated bitmap index issue seen previously. The new execution plan confirms this optimization, showing a marked reduction in rows read compared to the original.

@CollinBeczak CollinBeczak marked this pull request as ready for review February 27, 2025 19:57
@CollinBeczak
Copy link
Contributor Author

Here is a more extreme example, something that actually happens in a specific workflow at the moment. When the entire map is selected: -180, -85, 180, 85, things get exponentially slower.

SELECT tasks.id, tasks.name, tasks.parent_id, c.name, tasks.instruction, tasks.status, tasks.mapped_on,
          tasks.completed_time_spent, tasks.completed_by,
          tasks.bundle_id, tasks.is_bundle_primary, tasks.cooperative_work_json::TEXT as cooperative_work,
          task_review.review_status, task_review.review_requested_by, task_review.reviewed_by, task_review.reviewed_at,
          task_review.review_started_at, task_review.meta_review_status, task_review.meta_reviewed_by,
          task_review.meta_reviewed_at, task_review.additional_reviewers,
          ST_AsGeoJSON(tasks.location) AS location, priority,
          CASE WHEN task_review.review_started_at IS NULL
                THEN 0
                ELSE EXTRACT(epoch FROM (task_review.reviewed_at - task_review.review_started_at)) END
          AS reviewDuration
    FROM tasks
    
        INNER JOIN challenges c ON c.id = tasks.parent_id
        INNER JOIN projects p ON p.id = c.parent_id
        LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
      
   WHERE (tasks.id NOT IN (select item_id from locked WHERE
                  item_id = tasks.id AND item_type = 2
                  AND user_id != 1)) AND (c.deleted = false AND p.deleted = false) AND (tasks.location && ST_MakeEnvelope (-180, -85, 180, 85, 4326)) AND (tasks.status IN (0,1,2,3,4,5,6,9)) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE task_review.task_id = tasks.id AND task_review.review_status IN (0,1,2,3,4,5,6,7,-1)) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE (task_review.task_id = tasks.id) AND ((task_review.meta_review_status IN (0,1,2,3,5,6,7,-2,-1) OR task_review.meta_review_status IS NULL))) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND (tasks.priority IN (0,1,2)) AND (c.id IN (40400)) LIMIT 1001;

This takes 7 seconds to complete on the smaller staging environment. Here is the read rows:
Screenshot 2025-02-27 at 2 22 28 PM

As you can see, 40 million + rows are being read because of some weird sql logic with the filtering. This greatly slows it down.

The new query using the cte scan is much faster, it has virtually no change in performance between the narrowed down mapbounds and the global mapbounds taking only 200ms for the same results instead of 7 seconds:

WITH filtered_tasks AS (
          SELECT tasks.id
          FROM tasks
          INNER JOIN challenges c ON c.id = tasks.parent_id
          INNER JOIN projects p ON p.id = c.parent_id
          LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
          WHERE (tasks.id NOT IN (select item_id from locked WHERE
                  item_id = tasks.id AND item_type = 2
                  AND user_id != 1)) AND (c.deleted = false AND p.deleted = false) AND (tasks.status IN (0,1,2,3,4,5,6,9)) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE task_review.task_id = tasks.id AND task_review.review_status IN (0,1,2,3,4,5,6,7,-1)) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND ((tasks.id IN (SELECT task_id FROM task_review WHERE (task_review.task_id = tasks.id) AND ((task_review.meta_review_status IN (0,1,2,3,5,6,7,-2,-1) OR task_review.meta_review_status IS NULL))) OR NOT tasks.id IN (SELECT task_id FROM task_review task_review WHERE task_review.task_id = tasks.id))) AND (tasks.priority IN (0,1,2)) AND (c.id IN (40400))
        )
        SELECT  tasks.id,
    tasks.name,
    tasks.parent_id,
    c.name,
    tasks.instruction,
    tasks.status,
    tasks.mapped_on,
    tasks.completed_time_spent,
    tasks.completed_by,
    tasks.bundle_id,
    tasks.is_bundle_primary,
    tasks.cooperative_work_json::TEXT as cooperative_work,
    task_review.review_status,
    task_review.review_requested_by,
    task_review.reviewed_by,
    task_review.reviewed_at,
    task_review.review_started_at,
    task_review.meta_review_status,
    task_review.meta_reviewed_by,
    task_review.meta_reviewed_at,
    task_review.additional_reviewers,
    ST_AsGeoJSON(tasks.location) AS location,
    priority,
    CASE 
        WHEN task_review.review_started_at IS NULL THEN 0
        ELSE EXTRACT(epoch FROM (task_review.reviewed_at - task_review.review_started_at))
    END AS reviewDuration
        FROM filtered_tasks
        INNER JOIN tasks ON tasks.id = filtered_tasks.id
        INNER JOIN challenges c ON c.id = tasks.parent_id
        INNER JOIN projects p ON p.id = c.parent_id
        LEFT OUTER JOIN task_review ON task_review.task_id = tasks.id
        WHERE tasks.location && ST_MakeEnvelope(-115.57617187500001, 32.54681317351517, -87.5390625, 52.908902047770255, 4326)
        LIMIT 1001
      ;
Screenshot 2025-02-27 at 2 25 14 PM

Copy link
Contributor

@jake-low jake-low left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great writeup. Had a look at the code, LGTM.

@CollinBeczak CollinBeczak marked this pull request as draft February 28, 2025 23:06
@CollinBeczak
Copy link
Contributor Author

Converting this to a draft, I uncovered an underlying issue with this approach. The reason the earlier queries performed so quickly was that the challenge ID filter was applied before the location filter. This significantly slows down queries that lack the challenge ID filter. To address this, I need to find a way to prioritize the challenge ID filter over the location filter exclusively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants