This repository was archived by the owner on Dec 18, 2024. It is now read-only.
This repository was archived by the owner on Dec 18, 2024. It is now read-only.
Rerun 2022_05_01 #44
Closed
Description
- Make a copy of the first crawl's data on GCS to not be overwritten @giancarloaf
- Make a copy of the first crawl's data on BigQuery @giancarloaf
- Enable crawling 1 level of secondary pages @pmeenan
- Add the ability to distinguish between primary/secondary pages in the Dataflow pipeline @giancarloaf waiting on Add the ability to test secondary pages #12
- Add metadata to identify the original test URL to the HAR @pmeenan
- Update the Dataflow pipeline to parse the page
url
,pageid
, andrequestid
fields from the metadata above @giancarloaf - Flush Pub/Sub queue before starting the crawl @giancarloaf
- Restart the summary pipeline @giancarloaf
- Start the second crawl using the same URLs as before @pmeenan
Anything else?
Metadata
Metadata
Assignees
Labels
No labels