Skip to content

Commit

Permalink
Document blink_features tables (#860)
Browse files Browse the repository at this point in the history
* description

* description

* links

* update

* fix

* Update docs/gettingstarted_bigquery.md

* Apply suggestions from code review

---------

Co-authored-by: Barry Pollard <[email protected]>
  • Loading branch information
max-ostapenko and tunetheweb authored May 19, 2024
1 parent 359bfd9 commit d61e2c6
Showing 1 changed file with 19 additions and 10 deletions.
29 changes: 19 additions & 10 deletions docs/gettingstarted_bigquery.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This document is an update to [Ilya Grigorik's 2013 introduction](https://www.ig

In order to access the HTTP Archive via BigQuery, you'll need a Google account. To document this process for new visitors, this example uses a new Google account that has never logged into any Google Cloud services.

1. Navigate to the [Google Cloud Projects Page](https://console.cloud.google.com/start) and log in with your Google account if prompted. If this is your first time accessing Google Cloud, you may be prompted to accept the terms of service. Once you are logged in, you'll see a page like this -
1. Navigate to the [Google Cloud Projects Page](https://console.cloud.google.com/welcome) and log in with your Google account if prompted. If this is your first time accessing Google Cloud, you may be prompted to accept the terms of service. Once you are logged in, you'll see a page like this -

<img src="images/google-cloud-welcome.png" width="630" alt="Google Cloud Welcome">

Expand Down Expand Up @@ -73,13 +73,13 @@ Some of the types of tables you'll find useful when getting started are describe

### Summary Tables

* `summary_pages` tables:
* [`summary_pages`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2ssummary_pages) tables:
* Each row contains details about a single page including timings, # of requests, types of requests and sizes.
* Information about the page load such # of domains, redirects, errors, https requests, CDN, etc.
* Summary of different caching parameters.
* Each page URL is associated with a "pageid".

* `summary_requests` Tables:
* [`summary_requests`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2ssummary_requests) tables:
* Every single object loaded by all of the pages.
* Each object has a requestid and a pageid. The pageid can be used to JOIN the corresponding summary_pages table.
* Information about the object, and how it was loaded.
Expand All @@ -89,36 +89,47 @@ Some of the types of tables you'll find useful when getting started are describe

The HTTP Archive stores detailed information about each page load in [HAR (HTTP Archive) files](https://en.wikipedia.org/wiki/.har). Each HAR file is JSON formatted and contains detailed performance data about a web page. The [specification for this format](https://w3c.github.io/web-performance/specs/HAR/Overview.html) is produced by the Web Performance Working Group of the W3C. The HTTP Archive splits each HAR file into multiple BigQuery tables, which are described below.

* `pages` tables
* [`pages`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2spages) tables
* HAR extract for each page url.
* Table contains a url and a JSON-encoded HAR file for the document.
* These tables are large (~13GB as of Aug 2018).

* `requests` tables:
* [`requests`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2srequests) tables:
* HAR extract for each resource.
* Table contains a document url, resource url and a JSON-encoded HAR extract for each resource.
* These tables are very large (810GB as of Aug 2018)

* `response_bodies` tables:
* [`response_bodies`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2sresponse_bodies) tables:
* HAR extract containing response bodies for each request.
* Table contains a document url, resource url and a JSON-encoded HAR extract containing the first 2MB of each response body.
* Payloads are truncated at 2MB, and there is a column to indicate whether the payload was truncated.
* These tables are extremely large (2.5TB as of Aug 2018).

* `lighthouse` tables:
* [`lighthouse`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2slighthouse) tables:
* Results from a [Lighthouse](https://developers.google.com/web/tools/lighthouse/) audit of a page.
* Table contains a url, and a JSON-encoded copy of the lighthouse report.
* Lighthouse was intially only run on mobile, but as of May 2021 also runs as part of the desktop crawl.
* These tables are very large (2.3 TB for Mobile only as of May 2021)

### Other Tables

* `technologies` tables:
* [`technologies`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1shttparchive!2stechnologies) tables:
* Information about the technologies detected on each page (using [Wappalyser rules](https://github.com/HTTPArchive/wappalyzer)).
* Table contains a url and a list of names and categories for technologies detected on the page.
* This data is also available in the HAR of the `pages` table but is extracted into the `technologies` table for easy lookup.
* These tables are small (15GB as of May 2024).

* [`blink_features.features`](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1shttparchive!2sblink_features!3sfeatures) tables:
* Information about the [Blink features](https://chromestatus.com/roadmap) detected on each page. See also the summary `blink_features.usage` table below.
* Table contains a url and Blink feature names detected on the page.
* This data is also available in the HAR of the `pages` table but is extracted into the `blink_features` tables for easy lookup.
* This table is ~300GB per single platform as of May 2024.

* [`blink_features.usage`](https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1shttparchive!2sblink_features!3susage) table:
* Summary information about the [Blink features](https://chromestatus.com/roadmap) detected on each page.
* Table contains the num_urls, the pct_urls and sample urls for each feature.
* This data is also available in the HAR of the `pages` table but is extracted into the `blink_features` tables for easy lookup.
* This table is 944MB as of May 2024.
## Some Example Queries to Get Started Exploring the Data

The [HTTP Archive Discuss section](https://discuss.httparchive.org/) has lots of useful examples and discussion on how to analyze this data.
Expand Down Expand Up @@ -211,5 +222,3 @@ When analyzing the results from this, you can see the % of websites that use dif
To explore more interactive examples, read the [HTTP Archive Guided Tour](./guided_tour.md).

If you want to explore deeper you have everything you need - infrastructure, documentation, community. Enjoy exploring this data and feel free to share your results and ask questions on the [HTTP Archive Discuss section](https://discuss.httparchive.org/).


0 comments on commit d61e2c6

Please sign in to comment.