Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

Add import_all pipeline #75

Merged
merged 21 commits into from
Jun 30, 2022
Merged

Add import_all pipeline #75

merged 21 commits into from
Jun 30, 2022

Conversation

rviscomi
Copy link
Member

@rviscomi rviscomi commented May 26, 2022

New Dataflow pipeline to take HAR files and write them into the all.pages and all.requests tables per #15

@rviscomi rviscomi marked this pull request as ready for review May 26, 2022 17:43
@rviscomi rviscomi added the enhancement New feature or request label May 26, 2022
@rviscomi rviscomi requested a review from giancarloaf May 27, 2022 14:11
@rviscomi
Copy link
Member Author

Got a couple of errors when attempting a full-scale test on all May 12 data:

UnicodeEncodeError: 'utf-8 [while running 'MapPages-ptransform-153']' codec can't encode character '\ud83d' in position 135046: surrogates not allowed (Logs)

AttributeError: 'NoneType' object has no attribute 'get' [while running 'MapRequests-ptransform-133'] (Logs)

Just pushed a commit that should fix the second one. Still investigating the first.

@rviscomi rviscomi marked this pull request as draft May 28, 2022 05:39
@rviscomi
Copy link
Member Author

Moving back into draft mode for now. Still working out some bugs and don't want to distract from the urgent summary pipeline issues.

@rviscomi
Copy link
Member Author

Job failed

The workers don't seem to like having to handle 1B+ rows at a time for the requests table. We could try using a similar partitioning approach as the response_bodies in the non-summary pipeline.

@rviscomi
Copy link
Member Author

rviscomi commented May 30, 2022

Tweaked the encode/decode settings for custom metrics JSON stringification. Hopefully that fixes the exceptions, but I added error handling if not. So AFAIK the pipeline should work and I'm running a mobile job now. I'll hold off on running the desktop job to give @giancarloaf time on the Dataflow workers later tonight.

ran with crawls/android-May_12_2022

return get_requests(har, client, crawl_date)


def get_requests(har, client, crawl_date):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
'utf-8', 'surrogatepass').decode('utf-8', 'replace')


def from_json(string):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
return report_json


def partition_requests(har, client, crawl_date, index):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
if metadata:
# The page URL from metadata is more accurate.
# See https://github.com/HTTPArchive/data-pipeline/issues/48
page_url = metadata.get('tested_url', page_url)

Check notice

Code scanning / CodeQL

Unused local variable

The value assigned to local variable 'page_url' is never used.
NUM_PARTITIONS = 4


def partition_step(function, har, client, crawl_date, index):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
}]


def get_custom_metrics(page, wptid):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
return custom_metrics_json


def get_features(page, wptid):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
return list(technologies.values())


def get_lighthouse_reports(har, wptid):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
@giancarloaf
Copy link
Collaborator

I had to fill nulls for the request table's url field with empty strings (e.g. "") for now, but the job is working and tables are populated in the experimental_gc_all dataset.

https://console.cloud.google.com/dataflow/jobs/us-west1/2022-06-28_19_37_27-17401719508463575964?project=httparchive

Added a quality of life improvement as well. You can use the --input-file parameter to supply a file containing a list of har files for processing rather. This speeds up execution by skipping the 2-3 hours spent listing files on GCS with the --input parameter instead.

We will have to prepopulate these files manually for now (i.e. gsutil ls gs://httparchive/crawls/...), but in the future, we could generate them from the first run of every pipeline so they will be available for subsequent re-runs.

modules/import_all.py Outdated Show resolved Hide resolved
modules/import_all.py Outdated Show resolved Hide resolved
@rviscomi rviscomi marked this pull request as ready for review June 30, 2022 00:58
modules/import_all.py Outdated Show resolved Hide resolved
modules/import_all.py Show resolved Hide resolved
@rviscomi
Copy link
Member Author

LGTM, let's merge after the checks are done

modules/import_all.py Outdated Show resolved Hide resolved
@rviscomi rviscomi merged commit 28aa69c into main Jun 30, 2022
@rviscomi rviscomi deleted the all-dataset branch June 30, 2022 01:37
@rviscomi rviscomi mentioned this pull request Jun 30, 2022
5 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants