Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

Rerun 2022_05_01 #44

Closed
9 tasks done
rviscomi opened this issue May 10, 2022 · 7 comments
Closed
9 tasks done

Rerun 2022_05_01 #44

rviscomi opened this issue May 10, 2022 · 7 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented May 10, 2022

  • Make a copy of the first crawl's data on GCS to not be overwritten @giancarloaf
  • Make a copy of the first crawl's data on BigQuery @giancarloaf
  • Enable crawling 1 level of secondary pages @pmeenan
  • Add the ability to distinguish between primary/secondary pages in the Dataflow pipeline @giancarloaf waiting on Add the ability to test secondary pages #12
  • Add metadata to identify the original test URL to the HAR @pmeenan
  • Update the Dataflow pipeline to parse the page url, pageid, and requestid fields from the metadata above @giancarloaf
  • Flush Pub/Sub queue before starting the crawl @giancarloaf
  • Restart the summary pipeline @giancarloaf
  • Start the second crawl using the same URLs as before @pmeenan

Anything else?

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

The metadata has a tested_url field which has the page URL independent of anything the agent might do (the fixes to report URL correctly are also in place but the metadata is safest).

@rviscomi
Copy link
Member Author

Lighthouse is updating to 9.6 today. Is it possible to update the test agents before the crawl reruns?

@pmeenan
Copy link
Member

pmeenan commented May 11, 2022 via email

@rviscomi
Copy link
Member Author

rviscomi commented May 12, 2022

@pmeenan: the crawl should be ready to restart when you see this in the morning (Thursday the 12th).

@giancarloaf and I went through the remaining TODO items at the top of this issue and we should be good to go. I left the "flush Pub/Sub queue" one unchecked because we were still seeing some lingering messages coming through from the GCS backup of the first May crawl. @giancarloaf will be monitoring the Pub/Sub messages tonight to ensure that the queue is completely flushed by morning. (If not, at worst we'll have some summary data from both crawls in BQ, which we can clear out in SQL as needed)

Update: the dashboard is still showing many messages coming through:

image

Update: still going strong as of 7am... I don't think we're able to start the crawl until that settles down :(

Update: a rogue process kept moving HAR files between crawls subdirs and triggering pubsub messages. @giancarloaf killed the process and the noise has subsided. Should be good to start the crawl.

@pmeenan
Copy link
Member

pmeenan commented May 12, 2022

@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?

@rviscomi rviscomi mentioned this issue May 12, 2022
@giancarloaf
Copy link
Collaborator

@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?

Yep, this is currently in progress using the worker vm. Rick is seeing very slow transfer rate (~100K files per hour) and has decided it would be best to start a new crawl under a different name, to be renamed later.

I will also be restarting the streaming pipeline to incorporate changes from #49 merged earlier today.

@rviscomi
Copy link
Member Author

Closing this out. We're rerunning the crawl with today's date to avoid overwriting any of the previous data.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants