Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

Omit secondary pages from canonical BQ tables #51

Closed
4 tasks done
Tracked by #12
rviscomi opened this issue May 13, 2022 · 2 comments
Closed
4 tasks done
Tracked by #12

Omit secondary pages from canonical BQ tables #51

rviscomi opened this issue May 13, 2022 · 2 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented May 13, 2022

Until we complete the BQ dataset migration in #15, BQ users shouldn't have to think about handling secondary pages.

We should continue to store primary page data in the canonical BQ datasets:

  • summary_pages
  • summary_requests
  • pages
  • requests
  • response_bodies
  • lighthouse
  • technologies
  • blink_features
  • core_web_vitals
  • latest

The rollout process of secondary pages will be:

@rviscomi rviscomi added this to the M2: Utilizing capacity milestone May 13, 2022
@rviscomi rviscomi self-assigned this May 13, 2022
@rviscomi
Copy link
Member Author

rviscomi commented May 20, 2022

Running queries to generate home page only tables for the 2022_05_01 crawl. Here are some examples:

CREATE OR REPLACE TABLE
  `httparchive.pages.2022_05_01_desktop`
AS
SELECT
  *
FROM
  `httparchive.pages.2022_05_12_desktop`
WHERE
  JSON_VALUE(payload, '$._metadata.crawl_depth') = '0'
CREATE OR REPLACE TABLE
  `httparchive.requests.2022_05_01_mobile`
AS
SELECT
  *
FROM
  `httparchive.requests.2022_05_12_mobile`
WHERE
  page IN (
    SELECT
      url
    FROM
      `httparchive.pages.2022_05_01_mobile`)

Progress on filtering the 2022_05_01 tables down to only home pages:

  • pages desktop
  • pages mobile
  • requests desktop
  • requests mobile
  • lighthouse desktop
  • lighthouse mobile
  • technologies desktop
  • technologies mobile
  • response_bodies desktop
  • response_bodies mobile

The 2022_05_12 crawl still contains all secondary crawl data.

@giancarloaf
Copy link
Collaborator

#51 is blocking #12

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants