Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade report data pipelines #30

Merged
merged 81 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
48a3b35
demo report
max-ostapenko Nov 16, 2024
f752d9f
fix local package
max-ostapenko Nov 16, 2024
172e120
crawl reports tag triggered
max-ostapenko Nov 16, 2024
39ae950
Merge branch 'reports' of https://github.com/HTTPArchive/dataform int…
max-ostapenko Nov 17, 2024
eb76476
timeseries added
max-ostapenko Nov 17, 2024
4a2d145
split tables
max-ostapenko Nov 17, 2024
a8e2137
lint
max-ostapenko Nov 17, 2024
4250677
tech report tables
max-ostapenko Nov 20, 2024
607c6b2
check tech report sql
max-ostapenko Nov 21, 2024
aba1af4
Merge branch 'main' into main
max-ostapenko Nov 21, 2024
791c0e8
Merge branch 'reports' into reports
max-ostapenko Nov 21, 2024
3fff267
missing declaration
max-ostapenko Nov 21, 2024
3be0274
formatting
max-ostapenko Nov 21, 2024
4c361a9
Merge branch 'reports' into reports
max-ostapenko Nov 21, 2024
b804540
preOps
max-ostapenko Nov 21, 2024
cc04bb6
dataset change
max-ostapenko Nov 23, 2024
9c8e567
cwv_tech_report tested
max-ostapenko Nov 23, 2024
45f1095
Merge branch 'main' into main
max-ostapenko Nov 23, 2024
e38d4f0
Merge branch 'reports' into reports
max-ostapenko Nov 23, 2024
1185758
tech_reports moved
max-ostapenko Nov 25, 2024
9591bf0
exporter function draft
max-ostapenko Nov 25, 2024
4acdf05
fix depependencies
max-ostapenko Nov 25, 2024
9a1f13e
rename
max-ostapenko Nov 25, 2024
b2cd6b4
dataset renamed
max-ostapenko Nov 25, 2024
02d1db7
storage exp draft
max-ostapenko Nov 25, 2024
e090ebc
date column for histograms
max-ostapenko Nov 25, 2024
256fd88
dev flag
max-ostapenko Nov 25, 2024
c3f75d2
gsc export tested
max-ostapenko Nov 25, 2024
6ae6e72
pubsub sink prepared
max-ostapenko Nov 26, 2024
537aa60
export fn deployed
max-ostapenko Nov 26, 2024
b5b625b
order incompatible with partitions
max-ostapenko Nov 26, 2024
486ec2e
monitoring
max-ostapenko Nov 26, 2024
3fbf2bb
lint
max-ostapenko Nov 26, 2024
c34c57a
event parsing draft
max-ostapenko Nov 26, 2024
1db9ff6
cleanup before inserts
max-ostapenko Nov 26, 2024
f8bc51a
event parsing
max-ostapenko Nov 26, 2024
08d9fa6
partitioned exports
max-ostapenko Nov 26, 2024
1a2188d
exclude scripts
max-ostapenko Nov 26, 2024
0e11edf
firestore export draft
max-ostapenko Nov 27, 2024
65d310d
Merge branch 'main' into reports
max-ostapenko Nov 27, 2024
d46b68e
optional description
max-ostapenko Nov 28, 2024
8d316ce
single dataset
max-ostapenko Dec 2, 2024
941e157
move
max-ostapenko Dec 2, 2024
4b34849
incremental operations
max-ostapenko Dec 2, 2024
dd38945
docs update
max-ostapenko Dec 2, 2024
d64a316
firestore dict tested
max-ostapenko Dec 3, 2024
73c3100
reports tested
max-ostapenko Dec 3, 2024
d786036
full sql export
max-ostapenko Dec 3, 2024
3d6657d
trigger params
max-ostapenko Dec 3, 2024
b4dd900
Merge branch 'reports' into reports
max-ostapenko Dec 3, 2024
76b3d5f
hashed doc ids
max-ostapenko Dec 3, 2024
0819a7b
more resources and timeout
max-ostapenko Dec 3, 2024
eee1311
extend timeout
max-ostapenko Dec 3, 2024
9b317ca
gzip
max-ostapenko Dec 5, 2024
33df626
event example
max-ostapenko Dec 5, 2024
58b6c3c
esm
max-ostapenko Dec 6, 2024
da4718c
more parallelization improvements
max-ostapenko Dec 6, 2024
8e042a3
Merge branch 'main' into main
max-ostapenko Dec 6, 2024
1a1caa6
Merge branch 'reports' into reports
max-ostapenko Dec 6, 2024
7baa4df
tested batch reports
max-ostapenko Dec 8, 2024
687750a
Merge branch 'reports' into reports
max-ostapenko Dec 8, 2024
d61666b
testing fast deletion
max-ostapenko Dec 8, 2024
ab581df
deletion tested
max-ostapenko Dec 8, 2024
2f5bed6
limit concurrency
max-ostapenko Dec 8, 2024
a78999b
retries
max-ostapenko Dec 8, 2024
85a5690
wait to resolve
max-ostapenko Dec 8, 2024
2d116dd
tested deployed version
max-ostapenko Dec 9, 2024
9fd868a
cleanup for test merge
max-ostapenko Dec 9, 2024
e859d29
cwv-tech-report to prod db
max-ostapenko Dec 9, 2024
0c81fb2
note to unwrap pubsub payloads
max-ostapenko Dec 9, 2024
dbe38a1
cleanup
max-ostapenko Dec 9, 2024
dc5732e
lint
max-ostapenko Dec 9, 2024
ae875d9
Merge branch 'main' into reports
max-ostapenko Dec 9, 2024
a4eba5a
Merge branch 'main' into reports
max-ostapenko Dec 9, 2024
963ebfa
revisited template builder
max-ostapenko Dec 9, 2024
91822e1
cleanup
max-ostapenko Dec 9, 2024
e0de181
tf 6.13
max-ostapenko Dec 9, 2024
87909ef
lint
max-ostapenko Dec 9, 2024
24a9bac
renamed
max-ostapenko Dec 9, 2024
ade5867
aligned timeout with prod
max-ostapenko Dec 9, 2024
f2b56f0
simplify tags
max-ostapenko Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions definitions/output/reports/dynamic_publisher.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
const configs = new reports.HTTPArchiveReports()
const params = {
date: constants.currentMonth,
rankFilter: constants.devRankFilter
}
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query parameters.
I found only date.
Need to list all the required and add the queries to test them with.


const metrics = configs.listMetrics()
metrics.forEach(metric => {
metric.SQL.forEach(sql => {
publish(sql.type, {
type: 'table',
schema: 'reports',
tags: ['crawl_reports']
}).query(ctx => constants.fillTemplate(sql.query, params))
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reports_* datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.

Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).

})
})
15 changes: 12 additions & 3 deletions includes/constants.js
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
const today = (dataform.projectConfig.vars.today ? dataform.projectConfig.vars.today : new Date().toISOString()).substring(0, 10)
const currentMonth = today.substring(0, 8) + '01'
const fnDateUnderscored = (dateStr) => dateStr.replaceAll('-', '_')
const fnPastMonth = (monthISOstring) => {
function fnDateUnderscored (dateStr) {
return dateStr.replaceAll('-', '_')
}
function fnPastMonth (monthISOstring) {
const monthDate = new Date(monthISOstring)
monthDate.setMonth(monthDate.getMonth() - 1)
return monthDate.toISOString().substring(0, 10)
Expand All @@ -17,6 +19,12 @@ const [
'AND rank <= 10000'
]
: ['', '']
function fillTemplate (template, params) {
return template.replace(/{{(.*?)}}/g, (match, key) => {
const trimmedKey = key.trim()
return trimmedKey in params ? params[trimmedKey] : match
})
}

module.exports = {
today,
Expand All @@ -26,5 +34,6 @@ module.exports = {
clients,
booleans,
devTABLESAMPLE,
devRankFilter
devRankFilter,
fillTemplate
}
62 changes: 62 additions & 0 deletions includes/reports.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
const { config } = require('./reports_config')

class HTTPArchiveReports {
constructor () {
this.config = config
}

listReports () {
const reportIds = this.config._reports

const reports = reportIds.map(reportId => {
const report = this.getReport(reportId)
return report
})

console.log('reports', reports)

return reports
}

getReport (reportId) {
const report = this.config[reportId]
return {
id: reportId,
...report
}
}

listMetrics (reportId) {
if (reportId === undefined) {
const metrics = Object.keys(this.config._metrics).map(metricId => {
const metric = this.getMetric(metricId)
return metric
}).filter(metric => metric.SQL)

return metrics
} else {
const report = this.getReport(reportId)
const metricIds = report.metrics

const metrics = metricIds.map(metricId => {
const metric = this.getMetric(metricId)
return metric
}).filter(metric => metric.SQL)

return metrics
}
}

getMetric (metricId) {
const metric = this.config._metrics[metricId]

return {
id: metricId,
...metric
}
}
}

module.exports = {
HTTPArchiveReports
}
Loading