Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(weave): push heavy conditions into WHERE for calls stream query #3501

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

gtarpenning
Copy link
Member

@gtarpenning gtarpenning commented Jan 28, 2025

Description

Push heavy conditions into WHERE clause before aggregating in calls query. In testing, while this did not improve the amount of time a query took, it decreased max memory usage by 10x.

When testing in the clickhouse console on one of the historically impossibly bad queries, this change allows it to actually complete, although it still takes 20+ seconds...

Technically, the queries are different. In prod, the groupby before filtering allows us to include additional rows that have the same call_id but don't have dynamic fields matching the filters. I think the aggregation functions built into the table (going from call_parts to calls_merged) mitigate most of the common cases of duplicate rows (like deleted_at or display_name).

Example difference between master and branch query structure:
Master

WITH filtered_calls AS (...)
SELECT ...
FROM calls_merged
WHERE calls_merged.project_id = '<>'
  AND (calls_merged.id IN filtered_calls)
GROUP BY (calls_merged.project_id, calls_merged.id)
HAVING position(JSON_VALUE(any(calls_merged.output_dump), '$."prompt"'), 'ripples') > 0
ORDER BY any(calls_merged.started_at) DESC

Branch

WITH filtered_calls AS (...)
SELECT ...
FROM calls_merged
WHERE calls_merged.project_id = '<>'
  AND (calls_merged.id IN filtered_calls)
  AND position(JSON_VALUE(calls_merged.output_dump, '$."prompt"'), 'ripples') > 0
GROUP BY (calls_merged.project_id, calls_merged.id)
ORDER BY any(calls_merged.started_at) DESC

Testing

Back to back testing in a local environment with 20,000 calls with very large payloads:
Screenshot 2025-01-27 at 3 58 25 PM

Query generating above stats:

SELECT 
    event_time, 
    query_duration_ms / 1000 AS query_time_sec,
    read_rows,
    read_bytes / 1024 / 1024 AS read_mb,
    memory_usage / 1024 / 1024 AS max_memory_mb
FROM system.query_log
WHERE query LIKE '--%ripples%'
  AND type = 'QueryFinish'
ORDER BY event_time DESC
LIMIT 10;

Master:
filter-timing-master-2

Branch:
filter-timing-branch-3

@gtarpenning gtarpenning changed the title mvp, bad techers but technically working mvp, bad tekkers but technically working Jan 28, 2025
@gtarpenning gtarpenning changed the title mvp, bad tekkers but technically working mvp, bad tekkers but technically working (?) Jan 28, 2025
@circle-job-mirror
Copy link

circle-job-mirror bot commented Jan 28, 2025

exp_formatted = sqlparse.format(exp_query, reindent=True)
found_formatted = sqlparse.format(query, reindent=True)

assert exp_formatted == found_formatted
assert exp_params == params
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

easier to debug in this order

@gtarpenning gtarpenning changed the title mvp, bad tekkers but technically working (?) perf(weave): push heavy conditions into WHERE for calls stream query Jan 28, 2025
@gtarpenning gtarpenning marked this pull request as ready for review January 28, 2025 20:50
@gtarpenning gtarpenning requested a review from a team as a code owner January 28, 2025 20:50
@gtarpenning gtarpenning requested a review from tssweeney January 28, 2025 22:21
Copy link
Collaborator

@tssweeney tssweeney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before diving into the code, i have a concern:

I believe this will fail to return rows which have not been merged by the AMT. Referencing the example in the description: what happens if the start and end events are not merged? The resulting data will return the unmerged rows without their start events!

@gtarpenning
Copy link
Member Author

Before diving into the code, i have a concern:

I believe this will fail to return rows which have not been merged by the AMT. Referencing the example in the description: what happens if the start and end events are not merged? The resulting data will return the unmerged rows without their start events!

Hmm, looking at the query plan I do think this is correct, although i'm not sure how often this will happen in practice. I am immediately confronted with dumb ways around this, like, always including all the start events if conditioning on an end event and vice versa @tssweeney

@gtarpenning
Copy link
Member Author

@tssweeney We could also force merges by using FINAL... In large projects this will very likely be more performant than the groupby... In local testing, using a query that filters down to 200 rows, FINAL used 6x less memory than GROUP BY.

I'm still not exactly sure the best way to repro the conditions that would lead to the issue, merges are hard to predict... And the aggregation functions appear in my testing to actually be working as expected (ie, the query planner reports unmerged parts of the table, but doing filtering on the inputs still always returns the outputs as well). i'll use QA tomorrow.

assert res[0].inputs["param"]["value1"] == "hello"

# Does the query return the output?
assert res[0].output["d"] == 5
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test highlights the error case that the query creates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants