Skip to content
This repository has been archived by the owner on Dec 18, 2024. It is now read-only.

Add the ability to test secondary pages #12

Closed
4 tasks done
rviscomi opened this issue Feb 1, 2022 · 13 comments
Closed
4 tasks done

Add the ability to test secondary pages #12

rviscomi opened this issue Feb 1, 2022 · 13 comments
Assignees

Comments

@rviscomi
Copy link
Member

rviscomi commented Feb 1, 2022

@rviscomi
Copy link
Member Author

rviscomi commented Mar 4, 2022

The custom metric that discovers the secondary page URLs and orders them is here:

https://github.com/HTTPArchive/custom-metrics/blob/main/dist/crawl_links.js

@rviscomi
Copy link
Member Author

Update: the custom metric is live and we're ready to start crawling the secondary pages.

@pmeenan could you describe how the number of secondary pages to test will vary by site rank? Also where does that logic live?

The _metadata field in the HAR file contains a crawl_depth field where 0 represents a primary page and 1 represents a secondary page. In the future, we might crawl tertiary pages (URLs discovered on a secondary page) and this value would be 2, and so on. The link_depth field is a 0-based index in the ordered list of candidate URLs according to the custom metric's heuristics. Links with the biggest clickable area are listed first.

"_metadata": {
  "rank": 10000000,
  "page_id": 1491685,
  "layout": "Desktop",
  "crawl_depth": 0,
  "link_depth": 0
},

How should we distinguish secondary pages from primary pages in the summary datasets? A few options:

  • Create a new BOOL isRootPage field and set it to crawl_depth == 0
  • Create a new INT64 crawlDepth field
  • Create a new STRING metadata field and set it to the JSON-encoded value of _metadata (my current preference)

We would also need a way to distinguish secondary pages from each other. If we didn't put the entire metadata object in BQ we could create a new INT64 field linkDepth assuming 0 has no special value besides being the biggest link by area.

Do we need to add similar top-level data to the non-summary tables: pages, requests, response_bodies, lighthouse, technologies?

pages contains the _metadata field in the HAR payload anyway, but it might still make sense if we want to use BQ clustering on a field like crawlDepth. It's possible to join URLs with the metadata in the summary_pages dataset, but joining can be cumbersome. This might be a tolerable workaround temorarily.

Another question is, should we segregate secondary test results from the monthly BQ tables until we have time to test the results? We're planning to reorganize the datasets anyway in #15 so I think it's worth keeping the secondary data separate until we have a clearer idea of how everything will work and give users enough notice before their queries start to break.

To keep it separate, we could add the new fields to distinguish secondary pages as discussed above, but write the results to new experimental datasets. When the crawl is done, we can move the primary results into the "legacy" datasets. This also has the benefit of delineating when the streaming data is finished and ready to query.

Example query:

CREATE TABLE
  `httparchive.summary_pages.2022_05_01_mobile`
AS
SELECT
  * EXCEPT (metadata)
FROM
  `httparchive.experimental_summary_pages.2022_05_01_mobile`
WHERE
  JSON_VALUE(metadata, '$.crawl_depth') = 0

Both Dataflow pipelines that write to BQ would need to point to the experimental datasets.

@pmeenan @giancarloaf @tunetheweb @paulcalvano Any other ideas/considerations?

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

@rviscomi The crawl logic code is here. Currently all ranks get the same treatment but the logic can easily be modified to use a lookup table for mapping different configs based on rank.

Right now it is set to not crawl at all just in case so we don't send child links to the pipeline before it is ready.

As far as exposing the info on the tables, I'm a fan of a top-level bool isRootPage in all of the tables which will make queries trivial without having to parse the JSON metadata. I don't think the full depth details needs to be pulled out specifically but the metadata should be included as a separate field as well so some of that deeper analysis can be done (or at least available in the pages table for joining based on page ID).

@rviscomi
Copy link
Member Author

Got it thanks, so we'll be testing at most one secondary page at first.

or at least available in the pages table for joining based on page ID

All pages' URLs should be unique right? So we should be able to join based on that and not a numeric ID.

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

Correct, the page URL should be fine for joining as well.

@rviscomi
Copy link
Member Author

Proposed a schema for the new tables in #15 (comment)

@pmeenan is there any reference from secondary page results back to the primary/parent page? We could probably use the uniqueness of the origin as a way to tell that a group of pages are all derived from the same root, but wondering if this is (or could be) more explicitly labelled in the metadata.

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

Sure. I can add a parent_page_id and parent_page_url to the metadata.

@rviscomi
Copy link
Member Author

SGTM. Is it over-engineering if we add root_page_[id,url] fields as well for the page corresponding to the 0th crawl depth? Thinking about whether only referencing the parent would effectively create a linked list, which would be hard to resolve back to the root in SQL.

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

Done. Added parent_page_id, parent_page_url and parent_page_test_id to the test metadata (so we can link the data but also get back the exact WPT test that the crawl came from).

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

Sure, on it.

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

OK, there are root* versions of all three now as well (not populated unless it is a child page). root_page_id, root_page_url and root_page_test_id

@rviscomi
Copy link
Member Author

rviscomi commented May 10, 2022

Nit: could we set the root fields to the page's info when it's the root? That would simplify queries that group by the root page. Borrowing the schema from #15 (comment), a query might look like this:

WITH wordpress AS (
  SELECT
    client,
    JSON_VALUE(summary, '$.metadata.root_page_url') AS root_page,
    LOGICAL_OR(technology = 'WordPress') AS uses_wordpress
  FROM
    `httparchive.har.pages`,
    UNNEST(technologies)
  WHERE
    date = '2022-04-01'
  GROUP BY
    client,
    root_page)

SELECT
  client,
  COUNTIF(uses_wordpress) / COUNT(0) AS pct_wordpress_websites
FROM
  wordpress
GROUP BY
  client

Otherwise it's a bit of extra logic to coalesce the root back to the page URL. Not a huge deal but curious if we can simplify it.

COALESCE(JSON_VALUE(summary, '$.metadata.root_page_url'), page) AS root_page,

Hope it's not too contrived. Trying to think of how use cases that aggregate at the site-level might work.

@pmeenan
Copy link
Member

pmeenan commented May 10, 2022

Sure, done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants