Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade report data pipelines #30

Merged
merged 81 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
48a3b35
demo report
max-ostapenko Nov 16, 2024
f752d9f
fix local package
max-ostapenko Nov 16, 2024
172e120
crawl reports tag triggered
max-ostapenko Nov 16, 2024
39ae950
Merge branch 'reports' of https://github.com/HTTPArchive/dataform int…
max-ostapenko Nov 17, 2024
eb76476
timeseries added
max-ostapenko Nov 17, 2024
4a2d145
split tables
max-ostapenko Nov 17, 2024
a8e2137
lint
max-ostapenko Nov 17, 2024
4250677
tech report tables
max-ostapenko Nov 20, 2024
607c6b2
check tech report sql
max-ostapenko Nov 21, 2024
aba1af4
Merge branch 'main' into main
max-ostapenko Nov 21, 2024
791c0e8
Merge branch 'reports' into reports
max-ostapenko Nov 21, 2024
3fff267
missing declaration
max-ostapenko Nov 21, 2024
3be0274
formatting
max-ostapenko Nov 21, 2024
4c361a9
Merge branch 'reports' into reports
max-ostapenko Nov 21, 2024
b804540
preOps
max-ostapenko Nov 21, 2024
cc04bb6
dataset change
max-ostapenko Nov 23, 2024
9c8e567
cwv_tech_report tested
max-ostapenko Nov 23, 2024
45f1095
Merge branch 'main' into main
max-ostapenko Nov 23, 2024
e38d4f0
Merge branch 'reports' into reports
max-ostapenko Nov 23, 2024
1185758
tech_reports moved
max-ostapenko Nov 25, 2024
9591bf0
exporter function draft
max-ostapenko Nov 25, 2024
4acdf05
fix depependencies
max-ostapenko Nov 25, 2024
9a1f13e
rename
max-ostapenko Nov 25, 2024
b2cd6b4
dataset renamed
max-ostapenko Nov 25, 2024
02d1db7
storage exp draft
max-ostapenko Nov 25, 2024
e090ebc
date column for histograms
max-ostapenko Nov 25, 2024
256fd88
dev flag
max-ostapenko Nov 25, 2024
c3f75d2
gsc export tested
max-ostapenko Nov 25, 2024
6ae6e72
pubsub sink prepared
max-ostapenko Nov 26, 2024
537aa60
export fn deployed
max-ostapenko Nov 26, 2024
b5b625b
order incompatible with partitions
max-ostapenko Nov 26, 2024
486ec2e
monitoring
max-ostapenko Nov 26, 2024
3fbf2bb
lint
max-ostapenko Nov 26, 2024
c34c57a
event parsing draft
max-ostapenko Nov 26, 2024
1db9ff6
cleanup before inserts
max-ostapenko Nov 26, 2024
f8bc51a
event parsing
max-ostapenko Nov 26, 2024
08d9fa6
partitioned exports
max-ostapenko Nov 26, 2024
1a2188d
exclude scripts
max-ostapenko Nov 26, 2024
0e11edf
firestore export draft
max-ostapenko Nov 27, 2024
65d310d
Merge branch 'main' into reports
max-ostapenko Nov 27, 2024
d46b68e
optional description
max-ostapenko Nov 28, 2024
8d316ce
single dataset
max-ostapenko Dec 2, 2024
941e157
move
max-ostapenko Dec 2, 2024
4b34849
incremental operations
max-ostapenko Dec 2, 2024
dd38945
docs update
max-ostapenko Dec 2, 2024
d64a316
firestore dict tested
max-ostapenko Dec 3, 2024
73c3100
reports tested
max-ostapenko Dec 3, 2024
d786036
full sql export
max-ostapenko Dec 3, 2024
3d6657d
trigger params
max-ostapenko Dec 3, 2024
b4dd900
Merge branch 'reports' into reports
max-ostapenko Dec 3, 2024
76b3d5f
hashed doc ids
max-ostapenko Dec 3, 2024
0819a7b
more resources and timeout
max-ostapenko Dec 3, 2024
eee1311
extend timeout
max-ostapenko Dec 3, 2024
9b317ca
gzip
max-ostapenko Dec 5, 2024
33df626
event example
max-ostapenko Dec 5, 2024
58b6c3c
esm
max-ostapenko Dec 6, 2024
da4718c
more parallelization improvements
max-ostapenko Dec 6, 2024
8e042a3
Merge branch 'main' into main
max-ostapenko Dec 6, 2024
1a1caa6
Merge branch 'reports' into reports
max-ostapenko Dec 6, 2024
7baa4df
tested batch reports
max-ostapenko Dec 8, 2024
687750a
Merge branch 'reports' into reports
max-ostapenko Dec 8, 2024
d61666b
testing fast deletion
max-ostapenko Dec 8, 2024
ab581df
deletion tested
max-ostapenko Dec 8, 2024
2f5bed6
limit concurrency
max-ostapenko Dec 8, 2024
a78999b
retries
max-ostapenko Dec 8, 2024
85a5690
wait to resolve
max-ostapenko Dec 8, 2024
2d116dd
tested deployed version
max-ostapenko Dec 9, 2024
9fd868a
cleanup for test merge
max-ostapenko Dec 9, 2024
e859d29
cwv-tech-report to prod db
max-ostapenko Dec 9, 2024
0c81fb2
note to unwrap pubsub payloads
max-ostapenko Dec 9, 2024
dbe38a1
cleanup
max-ostapenko Dec 9, 2024
dc5732e
lint
max-ostapenko Dec 9, 2024
ae875d9
Merge branch 'main' into reports
max-ostapenko Dec 9, 2024
a4eba5a
Merge branch 'main' into reports
max-ostapenko Dec 9, 2024
963ebfa
revisited template builder
max-ostapenko Dec 9, 2024
91822e1
cleanup
max-ostapenko Dec 9, 2024
e0de181
tf 6.13
max-ostapenko Dec 9, 2024
87909ef
lint
max-ostapenko Dec 9, 2024
24a9bac
renamed
max-ostapenko Dec 9, 2024
ade5867
aligned timeout with prod
max-ostapenko Dec 9, 2024
f2b56f0
simplify tags
max-ostapenko Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,3 @@ jobs:
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
VALIDATE_GITHUB_ACTIONS: false
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@ node_modules/

# Terraform
infra/tf/.terraform/
infra/tf/tmp/
**/*.zip
11 changes: 2 additions & 9 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
FN_NAME = dataform-trigger

.PHONY: *

start:
npx functions-framework --target=$(FN_NAME) --source=./infra/dataform-trigger/ --signature-type=http --port=8080 --debug

tf_plan:
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan \
-var="FUNCTION_NAME=$(FN_NAME)"
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan

tf_apply:
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve \
-var="FUNCTION_NAME=$(FN_NAME)"
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve
42 changes: 7 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Tag: `crawl_complete`

### Core Web Vitals Technology Report

Tag: `cwv_tech_report`
Tag: `crux_ready`

- httparchive.core_web_vitals.technologies

Expand All @@ -26,7 +26,7 @@ Consumers:

### Blink Features Report

Tag: `blink_features_report`
Tag: `crawl_complete`

- httparchive.blink_features.features
- httparchive.blink_features.usage
Expand All @@ -35,30 +35,15 @@ Consumers:

- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)

### Legacy crawl results (to be deprecated)

Tag: `crawl_results_legacy`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
- httparchive.response_bodies.YYYY_MM_DD_client
- httparchive.summary_pages.YYYY_MM_DD_client
- httparchive.summary_requests.YYYY_MM_DD_client
- httparchive.technologies.YYYY_MM_DD_client

## Schedules

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription

Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
Tags: ["crawl_complete"]

2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler

Tags: ["cwv_tech_report"]
Tags: ["crux_ready"]

### Triggering workflows

Expand All @@ -72,20 +57,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.

### Dataform development workspace hints

1. In workflow settings vars:

- set `env_name: dev` to process sampled data in dev workspace.
- change `today` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.

2. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.

### Error Monitoring

The issues within the pipeline are being tracked using the following alerts:

1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/570799173843203905?authuser=7&project=httparchive)
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive)
#### Workspace hints

Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
1. In `workflow_settings.yaml` set `env_name: dev` to process sampled data.
2. In `includes/constants.js` set `today` or other variables to a custome value.
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,8 @@ for (const table of stagingTables) {
name: table
})
}

declare({
schema: 'wappalyzer',
name: 'apps'
})
2 changes: 1 addition & 1 deletion definitions/output/blink_features/features.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ publish('features', {
partitionBy: 'yyyymmdd',
clusterBy: ['client', 'rank']
},
tags: ['blink_features_report']
tags: ['crawl_complete']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE yyyymmdd = DATE '${constants.currentMonth}';
Expand Down
2 changes: 1 addition & 1 deletion definitions/output/blink_features/usage.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ publish('usage', {
schema: 'blink_features',
type: 'incremental',
protected: true,
tags: ['blink_features_report']
tags: ['crawl_complete']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE yyyymmdd = REPLACE('${constants.currentMonth}', '-', '');
Expand Down
65 changes: 28 additions & 37 deletions definitions/output/core_web_vitals/technologies.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,25 @@ publish('technologies', {
clusterBy: ['geo', 'app', 'rank', 'client'],
requirePartitionFilter: true
},
tags: ['cwv_tech_report'],
tags: ['crux_ready'],
dependOnDependencyAssertions: true
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${pastMonth}';

CREATE TEMP FUNCTION IS_GOOD(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
CREATE TEMP FUNCTION IS_GOOD(
good FLOAT64,
needs_improvement FLOAT64,
poor FLOAT64
) RETURNS BOOL AS (
SAFE_DIVIDE(good, good + needs_improvement + poor) >= 0.75
);

CREATE TEMP FUNCTION IS_NON_ZERO(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
CREATE TEMP FUNCTION IS_NON_ZERO(
good FLOAT64,
needs_improvement FLOAT64,
poor FLOAT64
) RETURNS BOOL AS (
good + needs_improvement + poor > 0
);
`).query(ctx => `
Expand All @@ -28,17 +36,15 @@ WITH geo_summary AS (
CAST(REGEXP_REPLACE(CAST(yyyymm AS STRING), r'(\\d{4})(\\d{2})', r'\\1-\\2-01') AS DATE) AS date,
* EXCEPT (country_code),
\`chrome-ux-report\`.experimental.GET_COUNTRY(country_code) AS geo
FROM
${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
WHERE
yyyymm = CAST(FORMAT_DATE('%Y%m', '${pastMonth}') AS INT64) AND
device IN ('desktop', 'phone')
UNION ALL
SELECT
* EXCEPT (yyyymmdd, p75_fid_origin, p75_cls_origin, p75_lcp_origin, p75_inp_origin),
'ALL' AS geo
FROM
${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
WHERE
date = '${pastMonth}' AND
device IN ('desktop', 'phone')
Expand Down Expand Up @@ -81,20 +87,17 @@ crux AS (
IS_GOOD(fast_ttfb, avg_ttfb, slow_ttfb) AS good_ttfb,
IS_NON_ZERO(fast_inp, avg_inp, slow_inp) AS any_inp,
IS_GOOD(fast_inp, avg_inp, slow_inp) AS good_inp
FROM
geo_summary,
FROM geo_summary,
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank
WHERE
rank <= _rank
WHERE rank <= _rank
),

technologies AS (
SELECT
technology.technology AS app,
client,
page AS url
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology
WHERE
date = '${pastMonth}'
Expand All @@ -106,8 +109,7 @@ UNION ALL
'ALL' AS app,
client,
page AS url
FROM
${ctx.ref('crawl', 'pages')}
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
Expand All @@ -117,21 +119,18 @@ categories AS (
SELECT
technology.technology AS app,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
GROUP BY
app
GROUP BY app
UNION ALL
SELECT
'ALL' AS app,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
FROM
${ctx.ref('crawl', 'pages')},
FROM ${ctx.ref('crawl', 'pages')},
UNNEST(technologies) AS technology,
UNNEST(technology.categories) AS category
WHERE
Expand All @@ -153,8 +152,7 @@ summary_stats AS (
SAFE.FLOAT64(lighthouse.categories.performance.score) AS performance,
SAFE.FLOAT64(lighthouse.categories.pwa.score) AS pwa,
SAFE.FLOAT64(lighthouse.categories.seo.score) AS seo
FROM
${ctx.ref('crawl', 'pages')}
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}'
${constants.devRankFilter}
Expand All @@ -174,16 +172,11 @@ lab_data AS (
AVG(performance) AS performance,
AVG(pwa) AS pwa,
AVG(seo) AS seo
FROM
summary_stats
JOIN
technologies
USING
(client, url)
JOIN
categories
USING
(app)
FROM summary_stats
JOIN technologies
USING (client, url)
JOIN categories
USING (app)
GROUP BY
client,
root_page_url,
Expand Down Expand Up @@ -232,10 +225,8 @@ SELECT
SAFE_CAST(APPROX_QUANTILES(bytesJS, 1000)[OFFSET(500)] AS INT64) AS median_bytes_js,
SAFE_CAST(APPROX_QUANTILES(bytesImg, 1000)[OFFSET(500)] AS INT64) AS median_bytes_image

FROM
lab_data
JOIN
crux
FROM lab_data
JOIN crux
USING
(client, root_page_url)
GROUP BY
Expand Down
49 changes: 49 additions & 0 deletions definitions/output/reports/cwv_tech_adoption.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
const pastMonth = constants.fnPastMonth(constants.currentMonth)

publish('cwv_tech_adoption', {
schema: 'reports',
type: 'incremental',
protected: true,
bigquery: {
partitionBy: 'date',
clusterBy: ['rank', 'geo']
},
tags: ['crux_ready']
}).preOps(ctx => `
CREATE TEMPORARY FUNCTION GET_ADOPTION(
records ARRAY<STRUCT<
client STRING,
origins INT64
>>)
RETURNS STRUCT<
desktop INT64,
mobile INT64
>
LANGUAGE js AS '''
return Object.fromEntries(
records.map(({client, origins}) => {
return [client, origins]
}))
''';

DELETE FROM ${ctx.self()}
WHERE date = '${pastMonth}';
`).query(ctx => `
/* {"dataform_trigger": "report_cwv_tech_complete", "date": "${pastMonth}", "name": "adoption", "type": "report"} */
SELECT
date,
app AS technology,
rank,
geo,
GET_ADOPTION(ARRAY_AGG(STRUCT(
client,
origins
))) AS adoption
FROM ${ctx.ref('core_web_vitals', 'technologies')}
WHERE date = '${pastMonth}'
GROUP BY
date,
app,
rank,
geo
`)
51 changes: 51 additions & 0 deletions definitions/output/reports/cwv_tech_categories.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
const pastMonth = constants.fnPastMonth(constants.currentMonth)

publish('cwv_tech_categories', {
schema: 'reports',
type: 'table',
tags: ['crux_ready']
}).query(ctx => `
/* {"dataform_trigger": "report_cwv_tech_complete", "name": "categories", "type": "dict"} */
WITH pages AS (
SELECT
root_page,
technologies
FROM ${ctx.ref('crawl', 'pages')}
WHERE
date = '${pastMonth}' AND
client = 'mobile'
${constants.devRankFilter}
),categories AS (
SELECT
category,
COUNT(DISTINCT root_page) AS origins
FROM pages,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
GROUP BY category
),
technologies AS (
SELECT
category,
technology,
COUNT(DISTINCT root_page) AS origins
FROM pages,
UNNEST(technologies) AS t,
UNNEST(t.categories) AS category
GROUP BY
category,
technology
)

SELECT
category,
categories.origins,
ARRAY_AGG(technology ORDER BY technologies.origins DESC) AS technologies
FROM categories
JOIN technologies
USING (category)
GROUP BY
category,
categories.origins
ORDER BY categories.origins DESC
`)
Loading
Loading