Skip to content

Commit ef54451

Browse files
Upgrade report data pipelines (#30)
* demo report * fix local package * crawl reports tag triggered * timeseries added * split tables * lint * tech report tables * check tech report sql * missing declaration * formatting * preOps * dataset change * cwv_tech_report tested * tech_reports moved * exporter function draft * fix depependencies * rename * dataset renamed * storage exp draft * date column for histograms * dev flag * gsc export tested * pubsub sink prepared * export fn deployed * order incompatible with partitions * monitoring * lint * event parsing draft * cleanup before inserts * event parsing * partitioned exports * exclude scripts * firestore export draft * optional description * single dataset * move * incremental operations * docs update * firestore dict tested * reports tested * full sql export * trigger params * hashed doc ids * more resources and timeout * extend timeout * gzip * event example * esm * more parallelization improvements * tested batch reports * testing fast deletion * deletion tested * limit concurrency * retries * wait to resolve * tested deployed version * cleanup for test merge * cwv-tech-report to prod db * note to unwrap pubsub payloads * cleanup * lint * revisited template builder * cleanup * tf 6.13 * lint * renamed * aligned timeout with prod * simplify tags
1 parent 5f0c2ed commit ef54451

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+4113
-201
lines changed

.github/workflows/linter.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -33,4 +33,3 @@ jobs:
3333
VALIDATE_JSCPD: false
3434
VALIDATE_JAVASCRIPT_PRETTIER: false
3535
VALIDATE_MARKDOWN_PRETTIER: false
36-
VALIDATE_GITHUB_ACTIONS: false

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ node_modules/
33

44
# Terraform
55
infra/tf/.terraform/
6+
infra/tf/tmp/
67
**/*.zip

Makefile

+2-9
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,7 @@
1-
FN_NAME = dataform-trigger
2-
31
.PHONY: *
42

5-
start:
6-
npx functions-framework --target=$(FN_NAME) --source=./infra/dataform-trigger/ --signature-type=http --port=8080 --debug
7-
83
tf_plan:
9-
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan \
10-
-var="FUNCTION_NAME=$(FN_NAME)"
4+
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan
115

126
tf_apply:
13-
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve \
14-
-var="FUNCTION_NAME=$(FN_NAME)"
7+
terraform -chdir=infra/tf init && terraform -chdir=infra/tf apply -auto-approve

README.md

+7-35
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Tag: `crawl_complete`
1616

1717
### Core Web Vitals Technology Report
1818

19-
Tag: `cwv_tech_report`
19+
Tag: `crux_ready`
2020

2121
- httparchive.core_web_vitals.technologies
2222

@@ -26,7 +26,7 @@ Consumers:
2626

2727
### Blink Features Report
2828

29-
Tag: `blink_features_report`
29+
Tag: `crawl_complete`
3030

3131
- httparchive.blink_features.features
3232
- httparchive.blink_features.usage
@@ -35,30 +35,15 @@ Consumers:
3535

3636
- chromestatus.com - [example](https://chromestatus.com/metrics/feature/timeline/popularity/2089)
3737

38-
### Legacy crawl results (to be deprecated)
39-
40-
Tag: `crawl_results_legacy`
41-
42-
- httparchive.all.pages
43-
- httparchive.all.parsed_css
44-
- httparchive.all.requests
45-
- httparchive.lighthouse.YYYY_MM_DD_client
46-
- httparchive.pages.YYYY_MM_DD_client
47-
- httparchive.requests.YYYY_MM_DD_client
48-
- httparchive.response_bodies.YYYY_MM_DD_client
49-
- httparchive.summary_pages.YYYY_MM_DD_client
50-
- httparchive.summary_requests.YYYY_MM_DD_client
51-
- httparchive.technologies.YYYY_MM_DD_client
52-
5338
## Schedules
5439

5540
1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription
5641

57-
Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]
42+
Tags: ["crawl_complete"]
5843

5944
2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler
6045

61-
Tags: ["cwv_tech_report"]
46+
Tags: ["crux_ready"]
6247

6348
### Triggering workflows
6449

@@ -72,20 +57,7 @@ In order to unify the workflow triggering mechanism, we use [a Cloud Run functio
7257
2. Make adjustments to the dataform configuration files and manually run a workflow to verify.
7358
3. Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
7459

75-
### Dataform development workspace hints
76-
77-
1. In workflow settings vars:
78-
79-
- set `env_name: dev` to process sampled data in dev workspace.
80-
- change `today` variable to a month in the past. May be helpful for testing pipelines based on `chrome-ux-report` data.
81-
82-
2. `definitions/extra/test_env.sqlx` script helps to setup the tables required to run pipelines when in dev workspace. It's disabled by default.
83-
84-
### Error Monitoring
85-
86-
The issues within the pipeline are being tracked using the following alerts:
87-
88-
1. the event trigger processing fails - [Dataform Trigger Function Error](https://console.cloud.google.com/monitoring/alerting/policies/570799173843203905?authuser=7&project=httparchive)
89-
2. a job in the workflow fails - "[Dataform Workflow Invocation Failed](https://console.cloud.google.com/monitoring/alerting/policies/16526940745374967367?authuser=7&project=httparchive)
60+
#### Workspace hints
9061

91-
Error notifications are sent to [#10x-infra](https://httparchive.slack.com/archives/C030V4WAVL3) Slack channel.
62+
1. In `workflow_settings.yaml` set `env_name: dev` to process sampled data.
63+
2. In `includes/constants.js` set `today` or other variables to a custome value.

definitions/sources/httparchive.js definitions/declarations/httparchive.js

+5
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,8 @@ for (const table of stagingTables) {
55
name: table
66
})
77
}
8+
9+
declare({
10+
schema: 'wappalyzer',
11+
name: 'apps'
12+
})

definitions/output/blink_features/features.js

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ publish('features', {
66
partitionBy: 'yyyymmdd',
77
clusterBy: ['client', 'rank']
88
},
9-
tags: ['blink_features_report']
9+
tags: ['crawl_complete']
1010
}).preOps(ctx => `
1111
DELETE FROM ${ctx.self()}
1212
WHERE yyyymmdd = DATE '${constants.currentMonth}';

definitions/output/blink_features/usage.js

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ publish('usage', {
22
schema: 'blink_features',
33
type: 'incremental',
44
protected: true,
5-
tags: ['blink_features_report']
5+
tags: ['crawl_complete']
66
}).preOps(ctx => `
77
DELETE FROM ${ctx.self()}
88
WHERE yyyymmdd = REPLACE('${constants.currentMonth}', '-', '');

definitions/output/core_web_vitals/technologies.js

+28-37
Original file line numberDiff line numberDiff line change
@@ -9,17 +9,25 @@ publish('technologies', {
99
clusterBy: ['geo', 'app', 'rank', 'client'],
1010
requirePartitionFilter: true
1111
},
12-
tags: ['cwv_tech_report'],
12+
tags: ['crux_ready'],
1313
dependOnDependencyAssertions: true
1414
}).preOps(ctx => `
1515
DELETE FROM ${ctx.self()}
1616
WHERE date = '${pastMonth}';
1717
18-
CREATE TEMP FUNCTION IS_GOOD(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
18+
CREATE TEMP FUNCTION IS_GOOD(
19+
good FLOAT64,
20+
needs_improvement FLOAT64,
21+
poor FLOAT64
22+
) RETURNS BOOL AS (
1923
SAFE_DIVIDE(good, good + needs_improvement + poor) >= 0.75
2024
);
2125
22-
CREATE TEMP FUNCTION IS_NON_ZERO(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
26+
CREATE TEMP FUNCTION IS_NON_ZERO(
27+
good FLOAT64,
28+
needs_improvement FLOAT64,
29+
poor FLOAT64
30+
) RETURNS BOOL AS (
2331
good + needs_improvement + poor > 0
2432
);
2533
`).query(ctx => `
@@ -28,17 +36,15 @@ WITH geo_summary AS (
2836
CAST(REGEXP_REPLACE(CAST(yyyymm AS STRING), r'(\\d{4})(\\d{2})', r'\\1-\\2-01') AS DATE) AS date,
2937
* EXCEPT (country_code),
3038
\`chrome-ux-report\`.experimental.GET_COUNTRY(country_code) AS geo
31-
FROM
32-
${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
39+
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'country_summary')}
3340
WHERE
3441
yyyymm = CAST(FORMAT_DATE('%Y%m', '${pastMonth}') AS INT64) AND
3542
device IN ('desktop', 'phone')
3643
UNION ALL
3744
SELECT
3845
* EXCEPT (yyyymmdd, p75_fid_origin, p75_cls_origin, p75_lcp_origin, p75_inp_origin),
3946
'ALL' AS geo
40-
FROM
41-
${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
47+
FROM ${ctx.ref('chrome-ux-report', 'materialized', 'device_summary')}
4248
WHERE
4349
date = '${pastMonth}' AND
4450
device IN ('desktop', 'phone')
@@ -81,20 +87,17 @@ crux AS (
8187
IS_GOOD(fast_ttfb, avg_ttfb, slow_ttfb) AS good_ttfb,
8288
IS_NON_ZERO(fast_inp, avg_inp, slow_inp) AS any_inp,
8389
IS_GOOD(fast_inp, avg_inp, slow_inp) AS good_inp
84-
FROM
85-
geo_summary,
90+
FROM geo_summary,
8691
UNNEST([1000, 10000, 100000, 1000000, 10000000, 100000000]) AS _rank
87-
WHERE
88-
rank <= _rank
92+
WHERE rank <= _rank
8993
),
9094
9195
technologies AS (
9296
SELECT
9397
technology.technology AS app,
9498
client,
9599
page AS url
96-
FROM
97-
${ctx.ref('crawl', 'pages')},
100+
FROM ${ctx.ref('crawl', 'pages')},
98101
UNNEST(technologies) AS technology
99102
WHERE
100103
date = '${pastMonth}'
@@ -106,8 +109,7 @@ UNION ALL
106109
'ALL' AS app,
107110
client,
108111
page AS url
109-
FROM
110-
${ctx.ref('crawl', 'pages')}
112+
FROM ${ctx.ref('crawl', 'pages')}
111113
WHERE
112114
date = '${pastMonth}'
113115
${constants.devRankFilter}
@@ -117,21 +119,18 @@ categories AS (
117119
SELECT
118120
technology.technology AS app,
119121
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
120-
FROM
121-
${ctx.ref('crawl', 'pages')},
122+
FROM ${ctx.ref('crawl', 'pages')},
122123
UNNEST(technologies) AS technology,
123124
UNNEST(technology.categories) AS category
124125
WHERE
125126
date = '${pastMonth}'
126127
${constants.devRankFilter}
127-
GROUP BY
128-
app
128+
GROUP BY app
129129
UNION ALL
130130
SELECT
131131
'ALL' AS app,
132132
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS category
133-
FROM
134-
${ctx.ref('crawl', 'pages')},
133+
FROM ${ctx.ref('crawl', 'pages')},
135134
UNNEST(technologies) AS technology,
136135
UNNEST(technology.categories) AS category
137136
WHERE
@@ -153,8 +152,7 @@ summary_stats AS (
153152
SAFE.FLOAT64(lighthouse.categories.performance.score) AS performance,
154153
SAFE.FLOAT64(lighthouse.categories.pwa.score) AS pwa,
155154
SAFE.FLOAT64(lighthouse.categories.seo.score) AS seo
156-
FROM
157-
${ctx.ref('crawl', 'pages')}
155+
FROM ${ctx.ref('crawl', 'pages')}
158156
WHERE
159157
date = '${pastMonth}'
160158
${constants.devRankFilter}
@@ -174,16 +172,11 @@ lab_data AS (
174172
AVG(performance) AS performance,
175173
AVG(pwa) AS pwa,
176174
AVG(seo) AS seo
177-
FROM
178-
summary_stats
179-
JOIN
180-
technologies
181-
USING
182-
(client, url)
183-
JOIN
184-
categories
185-
USING
186-
(app)
175+
FROM summary_stats
176+
JOIN technologies
177+
USING (client, url)
178+
JOIN categories
179+
USING (app)
187180
GROUP BY
188181
client,
189182
root_page_url,
@@ -232,10 +225,8 @@ SELECT
232225
SAFE_CAST(APPROX_QUANTILES(bytesJS, 1000)[OFFSET(500)] AS INT64) AS median_bytes_js,
233226
SAFE_CAST(APPROX_QUANTILES(bytesImg, 1000)[OFFSET(500)] AS INT64) AS median_bytes_image
234227
235-
FROM
236-
lab_data
237-
JOIN
238-
crux
228+
FROM lab_data
229+
JOIN crux
239230
USING
240231
(client, root_page_url)
241232
GROUP BY
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
const pastMonth = constants.fnPastMonth(constants.currentMonth)
2+
3+
publish('cwv_tech_adoption', {
4+
schema: 'reports',
5+
type: 'incremental',
6+
protected: true,
7+
bigquery: {
8+
partitionBy: 'date',
9+
clusterBy: ['rank', 'geo']
10+
},
11+
tags: ['crux_ready']
12+
}).preOps(ctx => `
13+
CREATE TEMPORARY FUNCTION GET_ADOPTION(
14+
records ARRAY<STRUCT<
15+
client STRING,
16+
origins INT64
17+
>>)
18+
RETURNS STRUCT<
19+
desktop INT64,
20+
mobile INT64
21+
>
22+
LANGUAGE js AS '''
23+
return Object.fromEntries(
24+
records.map(({client, origins}) => {
25+
return [client, origins]
26+
}))
27+
''';
28+
29+
DELETE FROM ${ctx.self()}
30+
WHERE date = '${pastMonth}';
31+
`).query(ctx => `
32+
/* {"dataform_trigger": "report_cwv_tech_complete", "date": "${pastMonth}", "name": "adoption", "type": "report"} */
33+
SELECT
34+
date,
35+
app AS technology,
36+
rank,
37+
geo,
38+
GET_ADOPTION(ARRAY_AGG(STRUCT(
39+
client,
40+
origins
41+
))) AS adoption
42+
FROM ${ctx.ref('core_web_vitals', 'technologies')}
43+
WHERE date = '${pastMonth}'
44+
GROUP BY
45+
date,
46+
app,
47+
rank,
48+
geo
49+
`)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
const pastMonth = constants.fnPastMonth(constants.currentMonth)
2+
3+
publish('cwv_tech_categories', {
4+
schema: 'reports',
5+
type: 'table',
6+
tags: ['crux_ready']
7+
}).query(ctx => `
8+
/* {"dataform_trigger": "report_cwv_tech_complete", "name": "categories", "type": "dict"} */
9+
WITH pages AS (
10+
SELECT
11+
root_page,
12+
technologies
13+
FROM ${ctx.ref('crawl', 'pages')}
14+
WHERE
15+
date = '${pastMonth}' AND
16+
client = 'mobile'
17+
${constants.devRankFilter}
18+
),categories AS (
19+
SELECT
20+
category,
21+
COUNT(DISTINCT root_page) AS origins
22+
FROM pages,
23+
UNNEST(technologies) AS t,
24+
UNNEST(t.categories) AS category
25+
GROUP BY category
26+
),
27+
technologies AS (
28+
SELECT
29+
category,
30+
technology,
31+
COUNT(DISTINCT root_page) AS origins
32+
FROM pages,
33+
UNNEST(technologies) AS t,
34+
UNNEST(t.categories) AS category
35+
GROUP BY
36+
category,
37+
technology
38+
)
39+
40+
SELECT
41+
category,
42+
categories.origins,
43+
ARRAY_AGG(technology ORDER BY technologies.origins DESC) AS technologies
44+
FROM categories
45+
JOIN technologies
46+
USING (category)
47+
GROUP BY
48+
category,
49+
categories.origins
50+
ORDER BY categories.origins DESC
51+
`)

0 commit comments

Comments
 (0)