Add the ability to test secondary pages #12

rviscomi · 2022-02-01T18:00:19Z

Create a custom metric to discover secondary page candidates
- Discussion started in Custom metric: top three secondary pages httparchive.org#400
- @rviscomi
Add the ability to enqueue the secondary pages in the pipeline at runtime
- @pmeenan @giancarloaf
Add metadata to the BigQuery schema to distinguish secondary pages from primary pages
- @pmeenan @giancarloaf
Omit secondary pages from canonical BQ tables #51
- @rviscomi

rviscomi · 2022-03-04T22:42:15Z

The custom metric that discovers the secondary page URLs and orders them is here:

https://github.com/HTTPArchive/custom-metrics/blob/main/dist/crawl_links.js

rviscomi · 2022-05-10T03:59:49Z

Update: the custom metric is live and we're ready to start crawling the secondary pages.

@pmeenan could you describe how the number of secondary pages to test will vary by site rank? Also where does that logic live?

The _metadata field in the HAR file contains a crawl_depth field where 0 represents a primary page and 1 represents a secondary page. In the future, we might crawl tertiary pages (URLs discovered on a secondary page) and this value would be 2, and so on. The link_depth field is a 0-based index in the ordered list of candidate URLs according to the custom metric's heuristics. Links with the biggest clickable area are listed first.

"_metadata": {
  "rank": 10000000,
  "page_id": 1491685,
  "layout": "Desktop",
  "crawl_depth": 0,
  "link_depth": 0
},

How should we distinguish secondary pages from primary pages in the summary datasets? A few options:

Create a new BOOL isRootPage field and set it to crawl_depth == 0
Create a new INT64 crawlDepth field
Create a new STRING metadata field and set it to the JSON-encoded value of _metadata (my current preference)

We would also need a way to distinguish secondary pages from each other. If we didn't put the entire metadata object in BQ we could create a new INT64 field linkDepth assuming 0 has no special value besides being the biggest link by area.

Do we need to add similar top-level data to the non-summary tables: pages, requests, response_bodies, lighthouse, technologies?

pages contains the _metadata field in the HAR payload anyway, but it might still make sense if we want to use BQ clustering on a field like crawlDepth. It's possible to join URLs with the metadata in the summary_pages dataset, but joining can be cumbersome. This might be a tolerable workaround temorarily.

Another question is, should we segregate secondary test results from the monthly BQ tables until we have time to test the results? We're planning to reorganize the datasets anyway in #15 so I think it's worth keeping the secondary data separate until we have a clearer idea of how everything will work and give users enough notice before their queries start to break.

To keep it separate, we could add the new fields to distinguish secondary pages as discussed above, but write the results to new experimental datasets. When the crawl is done, we can move the primary results into the "legacy" datasets. This also has the benefit of delineating when the streaming data is finished and ready to query.

Example query:

CREATE TABLE
  `httparchive.summary_pages.2022_05_01_mobile`
AS
SELECT
  * EXCEPT (metadata)
FROM
  `httparchive.experimental_summary_pages.2022_05_01_mobile`
WHERE
  JSON_VALUE(metadata, '$.crawl_depth') = 0

Both Dataflow pipelines that write to BQ would need to point to the experimental datasets.

@pmeenan @giancarloaf @tunetheweb @paulcalvano Any other ideas/considerations?

pmeenan · 2022-05-10T13:18:51Z

@rviscomi The crawl logic code is here. Currently all ranks get the same treatment but the logic can easily be modified to use a lookup table for mapping different configs based on rank.

Right now it is set to not crawl at all just in case so we don't send child links to the pipeline before it is ready.

As far as exposing the info on the tables, I'm a fan of a top-level bool isRootPage in all of the tables which will make queries trivial without having to parse the JSON metadata. I don't think the full depth details needs to be pulled out specifically but the metadata should be included as a separate field as well so some of that deeper analysis can be done (or at least available in the pages table for joining based on page ID).

rviscomi · 2022-05-10T14:14:24Z

Got it thanks, so we'll be testing at most one secondary page at first.

or at least available in the pages table for joining based on page ID

All pages' URLs should be unique right? So we should be able to join based on that and not a numeric ID.

pmeenan · 2022-05-10T14:48:54Z

Correct, the page URL should be fine for joining as well.

rviscomi · 2022-05-10T20:14:46Z

Proposed a schema for the new tables in #15 (comment)

@pmeenan is there any reference from secondary page results back to the primary/parent page? We could probably use the uniqueness of the origin as a way to tell that a group of pages are all derived from the same root, but wondering if this is (or could be) more explicitly labelled in the metadata.

pmeenan · 2022-05-10T20:24:44Z

Sure. I can add a parent_page_id and parent_page_url to the metadata.

rviscomi · 2022-05-10T20:28:53Z

SGTM. Is it over-engineering if we add root_page_[id,url] fields as well for the page corresponding to the 0th crawl depth? Thinking about whether only referencing the parent would effectively create a linked list, which would be hard to resolve back to the root in SQL.

pmeenan · 2022-05-10T20:29:00Z

Done. Added parent_page_id, parent_page_url and parent_page_test_id to the test metadata (so we can link the data but also get back the exact WPT test that the crawl came from).

pmeenan · 2022-05-10T20:29:31Z

Sure, on it.

pmeenan · 2022-05-10T20:32:10Z

OK, there are root* versions of all three now as well (not populated unless it is a child page). root_page_id, root_page_url and root_page_test_id

rviscomi · 2022-05-10T21:56:38Z

Nit: could we set the root fields to the page's info when it's the root? That would simplify queries that group by the root page. Borrowing the schema from #15 (comment), a query might look like this:

WITH wordpress AS (
  SELECT
    client,
    JSON_VALUE(summary, '$.metadata.root_page_url') AS root_page,
    LOGICAL_OR(technology = 'WordPress') AS uses_wordpress
  FROM
    `httparchive.har.pages`,
    UNNEST(technologies)
  WHERE
    date = '2022-04-01'
  GROUP BY
    client,
    root_page)

SELECT
  client,
  COUNTIF(uses_wordpress) / COUNT(0) AS pct_wordpress_websites
FROM
  wordpress
GROUP BY
  client

Otherwise it's a bit of extra logic to coalesce the root back to the page URL. Not a huge deal but curious if we can simplify it.

COALESCE(JSON_VALUE(summary, '$.metadata.root_page_url'), page) AS root_page,

Hope it's not too contrived. Trying to think of how use cases that aggregate at the site-level might work.

pmeenan · 2022-05-10T22:26:46Z

Sure, done.

rviscomi added this to the M2: Utilizing capacity milestone Feb 1, 2022

rviscomi assigned pmeenan Feb 1, 2022

rviscomi assigned rviscomi and giancarloaf Feb 1, 2022

giancarloaf mentioned this issue Mar 3, 2022

WPT agent extracts "next" links for given page and adds them to the work queue #33

Closed

rviscomi mentioned this issue Apr 18, 2022

Added a custom metric for extracting same-origin links HTTPArchive/custom-metrics#3

Merged

giancarloaf mentioned this issue May 12, 2022

Rerun 2022_05_01 #44

Closed

9 tasks

rviscomi mentioned this issue May 13, 2022

Omit secondary pages from canonical BQ tables #51

Closed

4 tasks

giancarloaf closed this as completed Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to test secondary pages #12

Add the ability to test secondary pages #12

rviscomi commented Feb 1, 2022 •

edited by giancarloaf

Loading

rviscomi commented Mar 4, 2022 •

edited

Loading

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

pmeenan commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022 •

edited

Loading

pmeenan commented May 10, 2022

Add the ability to test secondary pages #12

Add the ability to test secondary pages #12

Comments

rviscomi commented Feb 1, 2022 • edited by giancarloaf Loading

rviscomi commented Mar 4, 2022 • edited Loading

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022

pmeenan commented May 10, 2022

pmeenan commented May 10, 2022

pmeenan commented May 10, 2022

rviscomi commented May 10, 2022 • edited Loading

pmeenan commented May 10, 2022

rviscomi commented Feb 1, 2022 •

edited by giancarloaf

Loading

rviscomi commented Mar 4, 2022 •

edited

Loading

rviscomi commented May 10, 2022 •

edited

Loading