-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
Got a couple of errors when attempting a full-scale test on all May 12 data:
Just pushed a commit that should fix the second one. Still investigating the first. |
Moving back into draft mode for now. Still working out some bugs and don't want to distract from the urgent summary pipeline issues. |
The workers don't seem to like having to handle 1B+ rows at a time for the requests table. We could try using a similar partitioning approach as the response_bodies in the non-summary pipeline. |
Tweaked the encode/decode settings for custom metrics JSON stringification. Hopefully that fixes the exceptions, but I added error handling if not. So AFAIK the pipeline should work and I'm running a mobile job now. I'll hold off on running the desktop job to give @giancarloaf time on the Dataflow workers later tonight. ran with |
* adds the `--input_file` parameter
return get_requests(har, client, crawl_date) | ||
|
||
|
||
def get_requests(har, client, crawl_date): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
'utf-8', 'surrogatepass').decode('utf-8', 'replace') | ||
|
||
|
||
def from_json(string): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
return report_json | ||
|
||
|
||
def partition_requests(har, client, crawl_date, index): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
if metadata: | ||
# The page URL from metadata is more accurate. | ||
# See https://github.com/HTTPArchive/data-pipeline/issues/48 | ||
page_url = metadata.get('tested_url', page_url) |
Check notice
Code scanning / CodeQL
Unused local variable
NUM_PARTITIONS = 4 | ||
|
||
|
||
def partition_step(function, har, client, crawl_date, index): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
}] | ||
|
||
|
||
def get_custom_metrics(page, wptid): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
return custom_metrics_json | ||
|
||
|
||
def get_features(page, wptid): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
return list(technologies.values()) | ||
|
||
|
||
def get_lighthouse_reports(har, wptid): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
I had to fill nulls for the request table's Added a quality of life improvement as well. You can use the We will have to prepopulate these files manually for now (i.e. |
LGTM, let's merge after the checks are done |
New Dataflow pipeline to take HAR files and write them into the
all.pages
andall.requests
tables per #15