Skip to content

Commit

Permalink
Merge branch 'main' into tech_meta
Browse files Browse the repository at this point in the history
  • Loading branch information
max-ostapenko authored Oct 14, 2024
2 parents 4a18a19 + 4e477fb commit c00307d
Show file tree
Hide file tree
Showing 26 changed files with 791 additions and 184 deletions.
Binary file added .DS_Store
Binary file not shown.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
node_modules/
.df-credentials.json
tf/.terraform/
tf/temp
72 changes: 48 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,61 @@
# HTTP Archive BigQuery pipeline with Dataform

## Tables
This repo handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the `httparchive` dataset in BigQuery.

### Crawl tables in `all` dataset
## Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the `main` branch is used on each triggered pipeline run.

### Crawl results

Tag: `crawl_results_all`

- [x] httparchive.all.pages
- [x] httparchive.all.parsed_css
- [x] httparchive.all.requests
- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests

### Core Web Vitals Technology Report

Tag: `cwv_tech_report`

- [x] httparchive.core_web_vitals.technologies
- httparchive.core_web_vitals.technologies

Consumers:

- [HTTP Archive Tech Report](https://httparchive.org/reports/techreport/landing)

### Blink Features Report

Tag: `blink_features_report`

- httparchive.blink_features.features
- httparchive.blink_features.usage

Consumers:

### Legacy crawl tables (to be deprecated)
- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)

### Legacy crawl results (to be deprecated)

Tag: `crawl_results_legacy`

- [x] httparchive.lighthouse.YYYY_MM_DD_client
- [x] httparchive.pages.YYYY_MM_DD_client
- [x] httparchive.requests.YYYY_MM_DD_client
- [x] httparchive.response_bodies.YYYY_MM_DD_client
- [x] httparchive.summary_pages.YYYY_MM_DD_client
- [x] httparchive.summary_requests.YYYY_MM_DD_client
- [x] httparchive.technologies.YYYY_MM_DD_client
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
- httparchive.response_bodies.YYYY_MM_DD_client
- httparchive.summary_pages.YYYY_MM_DD_client
- httparchive.summary_requests.YYYY_MM_DD_client
- httparchive.technologies.YYYY_MM_DD_client

## Schedules

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription

Tags:

- crawl_results_all
- crawl_results_legacy
Tags: ["crawl_results_all", "blink_features_report", "crawl_results_legacy"]

2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler

Tags:

- cwv_tech_report
Tags: ["cwv_tech_report"]

### Triggering workflows

Expand All @@ -57,6 +71,16 @@ Tag: `crawl_results_legacy`

### Dataform development workspace hints

1. In workflow settings vars set `dev_name: dev` to process sampled data in dev workspace.
2. Change `current_month` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
3. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
1. In workflow settings vars:
1. set `env_name: dev` to process sampled data in dev workspace.
2. change `today` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
2. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

### Error Monitoring

The issues within the pipeline are being tracked using the following alerts:

1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/3950167380893746326?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/7137542315653007241?authuser=7&project=httparchive)

Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
64 changes: 46 additions & 18 deletions definitions/extra/test_env.js
Original file line number Diff line number Diff line change
@@ -1,26 +1,54 @@
const two_months_ago = constants.fn_past_month(constants.fn_past_month(constants.current_month));
const date = constants.current_month;

operate("test_env", {
hasOutput: true,
disabled: true // MUST NOT be commented in main branch
}).queries(ctx => `
CREATE OR REPLACE TABLE ${ctx.ref("all", "pages")} AS
SELECT *
FROM httparchive.all.pages ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
var resources_list = [{
datasetId: "all",
tableId: "pages"
},
{
datasetId: "all",
tableId: "requests"
},
//{datasetId: "all", tableId: "parsed_css"},
//{datasetId: "core_web_vitals", tableId: "technologies"},
];

resources_list.forEach(resource => {
operate(
`test_table ${resource.datasetId}_${resource.tableId}`, {
hasOutput: true
}
).queries(`
CREATE SCHEMA IF NOT EXISTS ${resource.datasetId}_dev;
CREATE OR REPLACE TABLE ${ctx.ref("all", "requests")} AS
DROP TABLE IF EXISTS ${resource.datasetId}_dev.dev_${resource.tableId};
CREATE TABLE IF NOT EXISTS ${resource.datasetId}_dev.dev_${resource.tableId} AS
SELECT *
FROM httparchive.all.requests ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
FROM \`${resource.datasetId}.${resource.tableId}\` ${constants.dev_TABLESAMPLE}
WHERE date = '${date}'
`);
})

operate("test_table blink_features_dev_dev_usage", {
hasOutput: true,
}).queries(`
CREATE SCHEMA IF NOT EXISTS blink_features_dev;
CREATE OR REPLACE TABLE ${ctx.ref("all", "parsed_css")} AS
CREATE TABLE IF NOT EXISTS blink_features_dev.dev_usage AS
SELECT *
FROM httparchive.all.parsed_css ${constants.dev_TABLESAMPLE}
WHERE date = '${two_months_ago}';
FROM blink_features.usage ${constants.dev_TABLESAMPLE}
WHERE yyyymmdd = '${date}';
`)

operate("test_table blink_features_dev_dev_features", {
hasOutput: true,
}).queries(`
CREATE SCHEMA IF NOT EXISTS blink_features_dev;
DROP TABLE IF EXISTS blink_features_dev.dev_features;
CREATE OR REPLACE TABLE ${ctx.ref("core_web_vitals", "technologies")} AS
CREATE TABLE IF NOT EXISTS blink_features_dev.dev_features AS
SELECT *
FROM httparchive.core_web_vitals.technologies
WHERE date = '${two_months_ago}'
FROM blink_features.features ${constants.dev_TABLESAMPLE}
WHERE yyyymmdd = DATE '${date}';
`)
44 changes: 28 additions & 16 deletions definitions/output/all/pages.js
Original file line number Diff line number Diff line change
@@ -1,33 +1,45 @@
publish("pages", {
type: "incremental",
protected: true,
schema: "all",
bigquery: {
partitionBy: "date",
clusterBy: ["client", "is_root_page", "rank"],
requirePartitionFilter: true
},
tags: ["crawl_results_all"],
type: "incremental",
protected: true,
schema: "all",
bigquery: {
partitionBy: "date",
clusterBy: ["client", "is_root_page", "rank"],
requirePartitionFilter: true
},
tags: ["crawl_results_all"],
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.current_month}';
`).query(ctx => `
SELECT *
FROM ${ctx.ref("crawl_staging", "pages")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'desktop' AND is_root_page = TRUE
FROM ${ctx.ref("crawl_staging", "pages")}
WHERE date = '${constants.current_month}'
AND client = 'desktop'
AND is_root_page = TRUE
${constants.dev_rank_filter}
`).postOps(ctx => `
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref("crawl_staging", "pages")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'desktop' AND is_root_page = FALSE;
FROM ${ctx.ref("crawl_staging", "pages")}
WHERE date = '${constants.current_month}'
AND client = 'desktop'
AND is_root_page = FALSE
${constants.dev_rank_filter};
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref("crawl_staging", "pages")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'mobile' AND is_root_page = TRUE;
WHERE date = '${constants.current_month}'
AND client = 'mobile'
AND is_root_page = TRUE
${constants.dev_rank_filter};
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref("crawl_staging", "pages")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'mobile' AND is_root_page = FALSE
FROM ${ctx.ref("crawl_staging", "pages")}
WHERE date = '${constants.current_month}'
AND client = 'mobile'
AND is_root_page = FALSE
${constants.dev_rank_filter};
`)
30 changes: 17 additions & 13 deletions definitions/output/all/parsed_css.js
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@
publish("parsed_css", {
type: "incremental",
protected: true,
schema: "all",
bigquery: {
partitionBy: "date",
clusterBy: ["client", "is_root_page", "rank", "page"],
requirePartitionFilter: true
},
tags: ["crawl_results_all"],
type: "incremental",
protected: true,
schema: "all",
bigquery: {
partitionBy: "date",
clusterBy: ["client", "is_root_page", "rank", "page"],
requirePartitionFilter: true
},
tags: ["crawl_results_all"],
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.current_month}';
`).query(ctx => `
SELECT *
FROM ${ctx.ref("crawl_staging", "parsed_css")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'desktop'
FROM ${ctx.ref("crawl_staging", "parsed_css")}
WHERE date = '${constants.current_month}'
AND client = 'desktop'
${constants.dev_rank_filter}
`).postOps(ctx => `
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref("crawl_staging", "parsed_css")} ${constants.dev_TABLESAMPLE}
WHERE date = '${constants.current_month}' AND client = 'mobile'
FROM ${ctx.ref("crawl_staging", "parsed_css")}
WHERE date = '${constants.current_month}'
AND client = 'mobile'
${constants.dev_rank_filter};
`)
Loading

0 comments on commit c00307d

Please sign in to comment.