Skip to content
This repository was archived by the owner on Dec 18, 2024. It is now read-only.
This repository was archived by the owner on Dec 18, 2024. It is now read-only.

Rerun 2022_05_01 #44

Closed
Closed
@rviscomi

Description

@rviscomi
  • Make a copy of the first crawl's data on GCS to not be overwritten @giancarloaf
  • Make a copy of the first crawl's data on BigQuery @giancarloaf
  • Enable crawling 1 level of secondary pages @pmeenan
  • Add the ability to distinguish between primary/secondary pages in the Dataflow pipeline @giancarloaf waiting on Add the ability to test secondary pages #12
  • Add metadata to identify the original test URL to the HAR @pmeenan
  • Update the Dataflow pipeline to parse the page url, pageid, and requestid fields from the metadata above @giancarloaf
  • Flush Pub/Sub queue before starting the crawl @giancarloaf
  • Restart the summary pipeline @giancarloaf
  • Start the second crawl using the same URLs as before @pmeenan

Anything else?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions