-
Notifications
You must be signed in to change notification settings - Fork 0
Add the ability to test secondary pages #12
Comments
The custom metric that discovers the secondary page URLs and orders them is here: https://github.com/HTTPArchive/custom-metrics/blob/main/dist/crawl_links.js |
Update: the custom metric is live and we're ready to start crawling the secondary pages. @pmeenan could you describe how the number of secondary pages to test will vary by site rank? Also where does that logic live? The "_metadata": {
"rank": 10000000,
"page_id": 1491685,
"layout": "Desktop",
"crawl_depth": 0,
"link_depth": 0
}, How should we distinguish secondary pages from primary pages in the summary datasets? A few options:
We would also need a way to distinguish secondary pages from each other. If we didn't put the entire metadata object in BQ we could create a new INT64 field Do we need to add similar top-level data to the non-summary tables:
Another question is, should we segregate secondary test results from the monthly BQ tables until we have time to test the results? We're planning to reorganize the datasets anyway in #15 so I think it's worth keeping the secondary data separate until we have a clearer idea of how everything will work and give users enough notice before their queries start to break. To keep it separate, we could add the new fields to distinguish secondary pages as discussed above, but write the results to new experimental datasets. When the crawl is done, we can move the primary results into the "legacy" datasets. This also has the benefit of delineating when the streaming data is finished and ready to query. Example query: CREATE TABLE
`httparchive.summary_pages.2022_05_01_mobile`
AS
SELECT
* EXCEPT (metadata)
FROM
`httparchive.experimental_summary_pages.2022_05_01_mobile`
WHERE
JSON_VALUE(metadata, '$.crawl_depth') = 0 Both Dataflow pipelines that write to BQ would need to point to the experimental datasets. @pmeenan @giancarloaf @tunetheweb @paulcalvano Any other ideas/considerations? |
@rviscomi The crawl logic code is here. Currently all ranks get the same treatment but the logic can easily be modified to use a lookup table for mapping different configs based on rank. Right now it is set to not crawl at all just in case so we don't send child links to the pipeline before it is ready. As far as exposing the info on the tables, I'm a fan of a top-level bool |
Got it thanks, so we'll be testing at most one secondary page at first.
All pages' URLs should be unique right? So we should be able to join based on that and not a numeric ID. |
Correct, the page URL should be fine for joining as well. |
Proposed a schema for the new tables in #15 (comment) @pmeenan is there any reference from secondary page results back to the primary/parent page? We could probably use the uniqueness of the origin as a way to tell that a group of pages are all derived from the same root, but wondering if this is (or could be) more explicitly labelled in the metadata. |
Sure. I can add a parent_page_id and parent_page_url to the metadata. |
SGTM. Is it over-engineering if we add root_page_[id,url] fields as well for the page corresponding to the 0th crawl depth? Thinking about whether only referencing the parent would effectively create a linked list, which would be hard to resolve back to the root in SQL. |
Done. Added |
Sure, on it. |
OK, there are root* versions of all three now as well (not populated unless it is a child page). |
Nit: could we set the root fields to the page's info when it's the root? That would simplify queries that group by the root page. Borrowing the schema from #15 (comment), a query might look like this: WITH wordpress AS (
SELECT
client,
JSON_VALUE(summary, '$.metadata.root_page_url') AS root_page,
LOGICAL_OR(technology = 'WordPress') AS uses_wordpress
FROM
`httparchive.har.pages`,
UNNEST(technologies)
WHERE
date = '2022-04-01'
GROUP BY
client,
root_page)
SELECT
client,
COUNTIF(uses_wordpress) / COUNT(0) AS pct_wordpress_websites
FROM
wordpress
GROUP BY
client Otherwise it's a bit of extra logic to coalesce the root back to the page URL. Not a huge deal but curious if we can simplify it. COALESCE(JSON_VALUE(summary, '$.metadata.root_page_url'), page) AS root_page, Hope it's not too contrived. Trying to think of how use cases that aggregate at the site-level might work. |
Sure, done. |
The text was updated successfully, but these errors were encountered: