Skip to content

Commit 85fd06d

Browse files
Split BigQuery export to Cloud Run Job (#37)
* split to cloud run job * some docs * lint * linter off * path moved * clean before build * cleanup * no wrapper subscription * local test * cleanup * wrapped subscription events
1 parent 31cd181 commit 85fd06d

19 files changed

+1680
-565
lines changed

.github/workflows/linter.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -33,3 +33,4 @@ jobs:
3333
VALIDATE_JSCPD: false
3434
VALIDATE_JAVASCRIPT_PRETTIER: false
3535
VALIDATE_MARKDOWN_PRETTIER: false
36+
VALIDATE_CHECKOV: false

Makefile

+6
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
.PHONY: *
22

3+
clean:
4+
rm -rf ./infra/dataform-trigger/node_modules
5+
rm -rf ./infra/dataform-export/node_modules
6+
rm -rf ./infra/bigquery-export/node_modules
7+
rm -rf infra/tf/tmp/*
8+
39
tf_plan:
410
terraform -chdir=infra/tf init -upgrade && terraform -chdir=infra/tf plan
511

infra/README.md

+57-37
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
# Infrastructure for the HTTP Archive data pipeline
22

3-
## Cloud function for triggering Dataform workflows
3+
## Cloud Function for triggering Dataform workflows
44

55
[dataformTrigger](https://console.cloud.google.com/functions/details/us-central1/dataformTrigger?env=gen2&authuser=7&project=httparchive) Cloud Run Function
66

7-
This function may be triggered by a PubSub message or Cloud Scheduler and triggers a Dataform workflow based on the trigger configuration provided.
8-
9-
### Trigger configuration
7+
This function may be triggered by a PubSub message or Cloud Scheduler and invokes a Dataform workflow based on the provided configuration.
108

119
Trigger types:
1210

@@ -16,7 +14,7 @@ Trigger types:
1614

1715
See [available trigger configurations](https://github.com/HTTPArchive/dataform/blob/main/src/index.js#L4).
1816

19-
Request body example with trigger name:
17+
Request body example:
2018

2119
```json
2220
{
@@ -26,7 +24,7 @@ Request body example with trigger name:
2624
}
2725
```
2826

29-
Trigger for local development:
27+
Request example for local development:
3028

3129
```bash
3230
curl -X POST http://localhost:8080/ \
@@ -38,25 +36,23 @@ curl -X POST http://localhost:8080/ \
3836
}'
3937
```
4038

41-
## Cloud Function for report data exports
39+
## Cloud Function for triggering data exports
4240

4341
[exportReport](https://console.cloud.google.com/functions/details/us-central1/bqExport?env=gen2&authuser=7&project=httparchive) Cloud Run Function
4442

45-
This function exports reports data to GCS or Firestore.
43+
This function triggers a job to export data to GCS or Firestore.
4644

47-
### Export configuration
45+
Request body example:
4846

4947
```json
5048
{
51-
"message": {
52-
"protoPayload": {
53-
"serviceData": {
54-
"jobCompletedEvent": {
55-
"job": {
56-
"jobConfiguration": {
57-
"query": {
58-
"query": "/* {\"dataform_trigger\": \"report_cwv_tech_complete\", \"date\": \"2024-11-01\", \"name\": \"technologies\", \"type\": \"dict\"} *\/"
59-
}
49+
"protoPayload": {
50+
"serviceData": {
51+
"jobCompletedEvent": {
52+
"job": {
53+
"jobConfiguration": {
54+
"query": {
55+
"query": "/* {\"dataform_trigger\": \"report_cwv_tech_complete\", \"date\": \"2024-11-01\", \"name\": \"technologies\", \"type\": \"dict\"} *\/"
6056
}
6157
}
6258
}
@@ -66,21 +62,19 @@ This function exports reports data to GCS or Firestore.
6662
}
6763
```
6864

69-
Trigger for local development:
65+
Request example for local development:
7066

7167
```bash
7268
curl -X POST http://localhost:8080/ \
7369
-H "Content-Type: application/json" \
7470
-d '{
75-
"message": {
76-
"protoPayload": {
77-
"serviceData": {
78-
"jobCompletedEvent": {
79-
"job": {
80-
"jobConfiguration": {
81-
"query": {
82-
"query": "/* {\"dataform_trigger\": \"report_complete\", \"date\": \"2024-11-01\", \"name\": \"bytesTotal\", \"type\": \"timeseries\"} *\/"
83-
}
71+
"protoPayload": {
72+
"serviceData": {
73+
"jobCompletedEvent": {
74+
"job": {
75+
"jobConfiguration": {
76+
"query": {
77+
"query": "/* {\"dataform_trigger\": \"report_complete\", \"date\": \"2024-11-01\", \"name\": \"bytesTotal\", \"type\": \"timeseries\"} *\/"
8478
}
8579
}
8680
}
@@ -96,15 +90,13 @@ or
9690
curl -X POST http://localhost:8080/ \
9791
-H "Content-Type: application/json" \
9892
-d '{
99-
"message": {
100-
"protoPayload": {
101-
"serviceData": {
102-
"jobCompletedEvent": {
103-
"job": {
104-
"jobConfiguration": {
105-
"query": {
106-
"query": "/* {\"dataform_trigger\": \"report_cwv_tech_complete\", \"date\": \"2024-11-01\", \"name\": \"adoption\", \"type\": \"report\"} *\/"
107-
}
93+
"protoPayload": {
94+
"serviceData": {
95+
"jobCompletedEvent": {
96+
"job": {
97+
"jobConfiguration": {
98+
"query": {
99+
"query": "/* {\"dataform_trigger\": \"report_cwv_tech_complete\", \"date\": \"2024-11-01\", \"name\": \"lighthouse\", \"type\": \"report\"} *\/"
108100
}
109101
}
110102
}
@@ -114,6 +106,25 @@ curl -X POST http://localhost:8080/ \
114106
}'
115107
```
116108

109+
## Cloud Run Job for exporting data
110+
111+
[exportData](https://console.cloud.google.com/run/detail/us-central1/export-data?authuser=7&project=httparchive) Cloud Run Job
112+
113+
This job exports data to GCS or Firestore based on the provided configuration.
114+
115+
Input parameters:
116+
117+
- `EXPORT_CONFIG` - JSON string with the export configuration.
118+
119+
Example values:
120+
121+
```plaintext
122+
{"dataform_trigger":"report_cwv_tech_complete","name":"technologies","type":"dict"}
123+
{"dataform_trigger":"report_cwv_tech_complete","date":"2024-11-01","name":"page_weight","type":"report"}
124+
{"dataform_trigger":"report_complete","name":"bytesTotal","type":"timeseries"}
125+
{"dataform_trigger":"report_complete","date":"2024-11-01","name":"bytesTotal","type":"histogram"}
126+
```
127+
117128
## Monitoring
118129

119130
The issues within the pipeline are being tracked using the following alerts:
@@ -134,6 +145,15 @@ npm run start
134145

135146
Then, in a separate terminal, run the command with the test trigger payload.
136147

148+
## Build
149+
150+
Building a container image for the `bigquery-export` Cloud Run Job:
151+
152+
```bash
153+
cd infra/bigquery-export
154+
npm run buildpack
155+
```
156+
137157
## Deployment
138158

139159
From project root directory run:

infra/bigquery-export/Dockerfile

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# checkov:skip=CKV_DOCKER_3:Ensure that a user for the container has been created
2+
FROM node:20-slim
3+
4+
WORKDIR /usr/src/app
5+
6+
COPY . .
7+
8+
# Clean up the node_modules directory
9+
RUN rm -rf node_modules
10+
11+
RUN npm ci --only=production
12+
13+
CMD ["node", "index.js"]
File renamed without changes.

infra/dataform-export/firestore.js infra/bigquery-export/firestore.js

+3-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,9 @@ export class FirestoreBatch {
1515
constructor () {
1616
this.firestore = new Firestore()
1717
this.bigquery = new BigQueryExport()
18-
this.firestore.settings({ databaseId: 'tech-report-apis-prod' })
18+
this.firestore.settings({
19+
databaseId: 'tech-report-apis-prod'
20+
})
1921
this.batchSize = 500
2022
this.maxConcurrentBatches = 200
2123
}

infra/bigquery-export/index.js

+34
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
import { ReportsExporter, TechReportsExporter } from './reports.js'
2+
3+
const exportConfig = process.env.EXPORT_CONFIG && JSON.parse(process.env.EXPORT_CONFIG)
4+
5+
async function main (exportConfig) {
6+
if (!exportConfig) {
7+
throw new Error('No config received')
8+
}
9+
10+
const eventName = exportConfig.dataform_trigger
11+
if (!eventName) {
12+
throw new Error('No trigger name found')
13+
}
14+
15+
if (eventName === 'report_complete') {
16+
console.info('Report export')
17+
console.log(exportConfig)
18+
const reports = new ReportsExporter()
19+
await reports.export(exportConfig)
20+
} else if (eventName === 'report_cwv_tech_complete') {
21+
console.info('Tech Report export')
22+
console.log(exportConfig)
23+
const techReports = new TechReportsExporter()
24+
await techReports.export(exportConfig)
25+
} else {
26+
throw new Error('Bad Request: unknown trigger name')
27+
}
28+
return 'OK'
29+
}
30+
31+
await main(exportConfig).catch((error) => {
32+
console.error(error)
33+
process.exit(1)
34+
})

0 commit comments

Comments
 (0)