Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some reports have failed for 2022_05_01 #601

Closed
github-actions bot opened this issue May 28, 2022 · 12 comments
Closed

Some reports have failed for 2022_05_01 #601

github-actions bot opened this issue May 28, 2022 · 12 comments
Assignees

Comments

@github-actions
Copy link
Contributor

Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/vulnJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/bootupJs.json
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/vulnJs.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/a11yButtonName.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/numUrls.json
2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/a11yButtonName.json

See latest log in GitHub Actions

@tunetheweb
Copy link
Member

Note we're still missing summary_pages.2022_05_01_mobile.

@rviscomi I've lost track of where we ended up with the first May run. Is that gone now? Or is it still available?

Note, this is not a rush and can wait until your return. But when June crawl finishes, the reports will try to run for any missing data (i.e. May, Mid-May and June) and so might fill in weird values if May runs are in such a bad state.

@rviscomi
Copy link
Member

Yeah 2022_05_01 should only be the home pages from the 2022_05_12 crawl. The action item in HTTPArchive/data-pipeline#72 (comment) is to regenerate the summary tables for the 2022_05_12 mobile crawl, so once that's complete we can filter it down to the home page data and alias to 2022_05_01 for the reporting.

@tunetheweb
Copy link
Member

Ah OK. So we;'ve thrown away 2022_05_01 completely and back-populated it from 2022_05_12?

Only thing is the 2022_05_12 dataset will still show up as another point in the graph (with same values as 2022_05_01 is they are a complete copy).

Or is plan to drop 2022_05_12 tables after back populating 2022_05_01?

@rviscomi
Copy link
Member

Yeah the first run of 2022_05_01 is no longer around because it had bad url values. We're backdating home page data from the 2022_05_12 crawl instead. More info about the migration plan to secondary pages in this issue: HTTPArchive/data-pipeline#51. In short:

  • dated tables like 2022_05_01 should only contain home pages for consistency with historical data/queries
  • only the all.pages and all.requests tables should combine home and secondary pages
  • the all pipeline is still WIP so 2022_05_12 is sticking around to give everyone early access to secondary page data

@rviscomi
Copy link
Member

rviscomi commented Jun 2, 2022

@tunetheweb would you be able to regenerate the reports now that the 2022_05_01 tables are all set up?

@tunetheweb
Copy link
Member

Do the 20220512 tables still exist with home + secondary? If so they will be included as well and probably don’t want to, until we come up with a strategy of how to include them.

Or can explicitly just run 20220501 for now and we’ll just need to decide on this before 20220601.

@rviscomi
Copy link
Member

rviscomi commented Jun 2, 2022

Yeah the 05_12 tables still exist with secondary pages. IIUC by setting the YYYY_MM_DD param, generate_reports.sh will cap the timeseries at 05_01. When 06_01 runs, we'll hopefully have moved the 05_12 data into the new all dataset.

@tunetheweb
Copy link
Member

Yep. Will kick that off in about an hour or so.

@tunetheweb
Copy link
Member

That's running now. Will check in on it tomorrow am.

@tunetheweb
Copy link
Member

That's all complete now.

@pmeenan any thoughts on the massive performance improvements on these graphs for mobile: https://httparchive.org/reports/loading-speed

@pmeenan
Copy link
Member

pmeenan commented Jun 3, 2022

My first guess/worry was going to be the CPU throttling but looking at the mobile template for tests, the throttling is still configured for 8x and the bandwidth is set to the 4G speeds.

I'm going to see if I can spot check a few pages to see if anything jumps out. Could be that the CPU throttling in Chrome broke again (or that I did something wrong on the switch to the new pipeline)

@tunetheweb
Copy link
Member

Ok let’s close this issue an open a new one for that. I’ll do that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants