-
Notifications
You must be signed in to change notification settings - Fork 372
Bug 1990742 - Ingest perfherder_data from JSON artifacts instead of parsing logs #8997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
treeherder/log_parser/tasks.py
Outdated
return artifact_list | ||
|
||
|
||
def post_perfherder_artifacts(job_log): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junngo I think it would be better for us to put this into a separate area. This folder seems to be specifically for parsing logs, but we're parsing JSONs instead. What do you think about having this task defined here in the perf directory? https://github.com/mozilla/treeherder/blob/505ad6b4047f77fc3ecdea63e57881116340d0fb/treeherder/perf/tasks.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmierz Splitting the code is a great idea. Creating a separate file under the code directory [0] looks good to me. It feels more cohesive to put it there, since the log parsing [1] also lives in that folder.
Please consider my opinion and feel free to tell me about the directory location.
with make_request(self.url, stream=True) as response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the new file based on your feedback. It seems more suitable since the JSON artifact isn’t part of the log parsing process :)
treeherder/etl/perf.py
Outdated
existing_replicates = set( | ||
PerformanceDatumReplicate.objects.filter( | ||
performance_datum=subtest_datum | ||
).values_list("value", flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is happening because of duplicate ingestion tasks (log, and json). I think we should find a way to default to using the JSON if they exist, and ignore the data we find in the logs. Maybe we could have a list of tests that we start with for testing this out? I'm thinking we could start with these tasks since the data they produce is not useful so any failures won't be problematic: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=regress&revision=6bd2ea6b9711dc7739d8ee7754b9330b11d0719d&selectedTaskRun=K87CGE6IT1GHl6wD4Skbyw.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, log parsing and the JSON file feature are both active right now, so I handled the duplication.
I’ll revert that, add an allowlist, and only call _load_perf_datum for whitelisted tests when needed.
34855c7
to
26bc32d
Compare
"awsy": ["ALL"], | ||
"build_metrics": ["decision", "compiler warnings"], | ||
"browsertime": ["constant-regression"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The job is processed if at least one suite name matches the allowlist (e.g. compiler warnings
).
This list is just a sample. We’ll gradually update it to expand JSON artifact usage.
[0]
https://firefoxci.taskcluster-artifacts.net/KZ6krBACTcyC1_q_tUejTA/0/public/build/perfherder-data-building.json
I have a list of frameworks generated locally by django code. [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note:
# treeherder/etl/jobs.py
parse_logs.apply_async(queue=queue, args=[job.id, [job_log.id], priority])
I considered splitting the queues, but decided to keep using the existing ones to avoid code duplication and increased complexity.
Currently, Treeherder ingests performance data (
PERFHERDER_DATA:
) by parsing raw logs.This patch supports reading data from the
perfherder-data.json
artifact instead.For now, both the existing log parsing and the new JSON ingestion run in parallel to maintain compatibility.
bugzilla :https://bugzilla.mozilla.org/show_bug.cgi?id=1990742