Document blink_features tables (#860)

* description * description * links * update * fix * Update docs/gettingstarted_bigquery.md * Apply suggestions from code review --------- Co-authored-by: Barry Pollard <[email protected]>
HTTPArchive · May 19, 2024 · d61e2c6 · d61e2c6
1 parent 359bfd9
commit d61e2c6
Showing 1 changed file with 19 additions and 10 deletions.
diff --git a/docs/gettingstarted_bigquery.md b/docs/gettingstarted_bigquery.md
@@ -15,7 +15,7 @@ This document is an update to [Ilya Grigorik's 2013 introduction](https://www.ig
 
 In order to access the HTTP Archive via BigQuery, you'll need a Google account.  To document this process for new visitors, this example uses a new Google account that has never logged into any Google Cloud services.
 
-1. Navigate to the [Google Cloud Projects Page](https://console.cloud.google.com/start) and log in with your Google account if prompted.  If this is your first time accessing Google Cloud, you may be prompted to accept the terms of service. Once you are logged in, you'll see a page like this -
+1. Navigate to the [Google Cloud Projects Page](https://console.cloud.google.com/welcome) and log in with your Google account if prompted.  If this is your first time accessing Google Cloud, you may be prompted to accept the terms of service. Once you are logged in, you'll see a page like this -
 
   <img src="images/google-cloud-welcome.png" width="630" alt="Google Cloud Welcome">
 
@@ -73,13 +73,13 @@ Some of the types of tables you'll find useful when getting started are describe
 
 ### Summary Tables
 
-* `summary_pages` tables:
+* [`summary_pages`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2ssummary_pages) tables:
     * Each row contains details about a single page including timings, # of requests, types of requests and sizes.
     * Information about the page load such # of domains, redirects, errors, https requests, CDN, etc.
     * Summary of different caching parameters.
     * Each page URL is associated with a "pageid".
 
-* `summary_requests` Tables:
+* [`summary_requests`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2ssummary_requests) tables:
     * Every single object loaded by all of the pages.
     * Each object has a requestid and a pageid.  The pageid can be used to JOIN the corresponding summary_pages table.
     * Information about the object, and how it was loaded.
@@ -89,36 +89,47 @@ Some of the types of tables you'll find useful when getting started are describe
 
 The HTTP Archive stores detailed information about each page load in [HAR (HTTP Archive) files](https://en.wikipedia.org/wiki/.har). Each HAR file is JSON formatted and contains detailed performance data about a web page.  The [specification for this format](https://w3c.github.io/web-performance/specs/HAR/Overview.html) is produced by the Web Performance Working Group of the W3C. The HTTP Archive splits each HAR file into multiple BigQuery tables, which are described below.
 
-* `pages` tables
+* [`pages`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2spages) tables
     * HAR extract for each page url.
     * Table contains a url and a JSON-encoded HAR file for the document.
     * These tables are large (~13GB as of Aug 2018).
 
-* `requests` tables:
+* [`requests`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2srequests) tables:
     * HAR extract for each resource.
     * Table contains a document url, resource url and a JSON-encoded HAR extract for each resource.
     * These tables are very large (810GB as of Aug 2018)
 
-* `response_bodies` tables:
+* [`response_bodies`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2sresponse_bodies) tables:
     * HAR extract containing response bodies for each request.
     * Table contains a document url, resource url and a JSON-encoded HAR extract containing the first 2MB of each response body.
     * Payloads are truncated at 2MB, and there is a column to indicate whether the payload was truncated.
     * These tables are extremely large (2.5TB as of Aug 2018).
 
-* `lighthouse` tables:
+* [`lighthouse`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2slighthouse) tables:
     * Results from a [Lighthouse](https://developers.google.com/web/tools/lighthouse/) audit of a page.
     * Table contains a url, and a JSON-encoded copy of the lighthouse report.
     * Lighthouse was intially only run on mobile, but as of May 2021 also runs as part of the desktop crawl.
     * These tables are very large (2.3 TB for Mobile only as of May 2021)
 
 ### Other Tables
 
-* `technologies` tables:
+* [`technologies`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2stechnologies) tables:
     * Information about the technologies detected on each page (using [Wappalyser rules](https://github.com/HTTPArchive/wappalyzer)).
     * Table contains a url and a list of names and categories for technologies detected on the page.
     * This data is also available in the HAR of the `pages` table but is extracted into the `technologies` table for easy lookup.
     * These tables are small (15GB as of May 2024).
 
+* [`blink_features.features`](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1shttparchive!2sblink_features!3sfeatures) tables:
+    * Information about the [Blink features](https://chromestatus.com/roadmap) detected on each page. See also the  summary `blink_features.usage` table below.
+    * Table contains a url and Blink feature names detected on the page.
+    * This data is also available in the HAR of the `pages` table but is extracted into the `blink_features` tables for easy lookup.
+    * This table is ~300GB per single platform as of May 2024.
+
+* [`blink_features.usage`](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1shttparchive!2sblink_features!3susage) table:
+    * Summary information about the [Blink features](https://chromestatus.com/roadmap) detected on each page.
+    * Table contains the num_urls, the pct_urls and sample urls for each feature.
+    * This data is also available in the HAR of the `pages` table but is extracted into the `blink_features` tables for easy lookup.
+    * This table is 944MB as of May 2024.
 ## Some Example Queries to Get Started Exploring the Data
 
 The [HTTP Archive Discuss section](https://discuss.httparchive.org/) has lots of useful examples and discussion on how to analyze this data.
@@ -211,5 +222,3 @@ When analyzing the results from this, you can see the % of websites that use dif
 To explore more interactive examples, read the [HTTP Archive Guided Tour](./guided_tour.md).
 
 If you want to explore deeper you have everything you need - infrastructure, documentation, community. Enjoy exploring this data and feel free to share your results and ask questions on the [HTTP Archive Discuss section](https://discuss.httparchive.org/).
-
-