The MongoDB to BigQuery template is a batch pipeline that reads documents from
MongoDB and writes them to BigQuery as specified by the userOption
parameter.
📝 This is a Google-provided template! Please check Provided templates documentation on how to use it without having to build from sources using Create job from template.
💡 This is a generated documentation based on Metadata Annotations . Do not change this file directly.
- mongoDbUri: The MongoDB connection URI in the format
mongodb+srv://:@.
. - database: Database in MongoDB to read the collection from. For example,
my-db
. - collection: Name of the collection inside MongoDB database. For example,
my-collection
. - userOption:
FLATTEN
,JSON
, orNONE
.FLATTEN
flattens the documents to the single level.JSON
stores document in BigQuery JSON format.NONE
stores the whole document as a JSON-formatted STRING. Defaults to: NONE. - outputTableSpec: The BigQuery table to write to. For example,
bigquery-project:dataset.output_table
.
- KMSEncryptionKey: Cloud KMS Encryption Key to decrypt the mongodb uri connection string. If Cloud KMS key is passed in, the mongodb uri connection string must all be passed in encrypted. For example,
projects/your-project/locations/global/keyRings/your-keyring/cryptoKeys/your-key
. - filter: Bson filter in json format. For example,
{ "val": { $gt: 0, $lt: 9 }}
. - useStorageWriteApi: If
true
, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value isfalse
. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api). - useStorageWriteApiAtLeastOnce: When using the Storage Write API, specifies the write semantics. To use at-least-once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to
true
. To use exactly-once semantics, set the parameter tofalse
. This parameter applies only whenuseStorageWriteApi
istrue
. The default value isfalse
. - bigQuerySchemaPath: The Cloud Storage path for the BigQuery JSON schema. For example,
gs://your-bucket/your-schema.json
. - javascriptDocumentTransformGcsPath: The Cloud Storage URI of the
.js
file that defines the JavaScript user-defined function (UDF) to use. For example,gs://your-bucket/your-transforms/*.js
. - javascriptDocumentTransformFunctionName: The name of the JavaScript user-defined function (UDF) to use. For example, if your JavaScript function code is
myTransform(inJson) { /*...do stuff...*/ }
, then the function name is myTransform. For sample JavaScript UDFs, see UDF Examples (https://github.com/GoogleCloudPlatform/DataflowTemplates#udf-examples). For example,transform
.
- Java 17
- Maven
- gcloud CLI, and execution of the
following commands:
gcloud auth login
gcloud auth application-default login
🌟 Those dependencies are pre-installed if you use Google Cloud Shell!
This README provides instructions using the Templates Plugin.
This template is a Flex Template, meaning that the pipeline code will be containerized and the container will be executed on Dataflow. Please check Use Flex Templates and Configure Flex Templates for more information.
If the plan is to just stage the template (i.e., make it available to use) by
the gcloud
command or Dataflow "Create job from template" UI,
the -PtemplatesStage
profile should be used:
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
mvn clean package -PtemplatesStage \
-DskipTests \
-DprojectId="$PROJECT" \
-DbucketName="$BUCKET_NAME" \
-DstagePrefix="templates" \
-DtemplateName="MongoDB_to_BigQuery" \
-f v2/mongodb-to-googlecloud
The command should build and save the template to Google Cloud, and then print the complete location on Cloud Storage:
Flex Template was staged! gs://<bucket-name>/templates/flex/MongoDB_to_BigQuery
The specific path should be copied as it will be used in the following steps.
Using the staged template:
You can use the path above run the template (or share with others for execution).
To start a job with the template at any time using gcloud
, you are going to
need valid resources for the required parameters.
Provided that, the following command line can be used:
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
export REGION=us-central1
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/MongoDB_to_BigQuery"
### Required
export MONGO_DB_URI=<mongoDbUri>
export DATABASE=<database>
export COLLECTION=<collection>
export USER_OPTION=NONE
export OUTPUT_TABLE_SPEC=<outputTableSpec>
### Optional
export KMSENCRYPTION_KEY=<KMSEncryptionKey>
export FILTER=<filter>
export USE_STORAGE_WRITE_API=false
export USE_STORAGE_WRITE_API_AT_LEAST_ONCE=false
export BIG_QUERY_SCHEMA_PATH=<bigQuerySchemaPath>
export JAVASCRIPT_DOCUMENT_TRANSFORM_GCS_PATH=<javascriptDocumentTransformGcsPath>
export JAVASCRIPT_DOCUMENT_TRANSFORM_FUNCTION_NAME=<javascriptDocumentTransformFunctionName>
gcloud dataflow flex-template run "mongodb-to-bigquery-job" \
--project "$PROJECT" \
--region "$REGION" \
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
--parameters "mongoDbUri=$MONGO_DB_URI" \
--parameters "database=$DATABASE" \
--parameters "collection=$COLLECTION" \
--parameters "userOption=$USER_OPTION" \
--parameters "KMSEncryptionKey=$KMSENCRYPTION_KEY" \
--parameters "filter=$FILTER" \
--parameters "useStorageWriteApi=$USE_STORAGE_WRITE_API" \
--parameters "useStorageWriteApiAtLeastOnce=$USE_STORAGE_WRITE_API_AT_LEAST_ONCE" \
--parameters "outputTableSpec=$OUTPUT_TABLE_SPEC" \
--parameters "bigQuerySchemaPath=$BIG_QUERY_SCHEMA_PATH" \
--parameters "javascriptDocumentTransformGcsPath=$JAVASCRIPT_DOCUMENT_TRANSFORM_GCS_PATH" \
--parameters "javascriptDocumentTransformFunctionName=$JAVASCRIPT_DOCUMENT_TRANSFORM_FUNCTION_NAME"
For more information about the command, please check: https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
Using the plugin:
Instead of just generating the template in the folder, it is possible to stage and run the template in a single command. This may be useful for testing when changing the templates.
export PROJECT=<my-project>
export BUCKET_NAME=<bucket-name>
export REGION=us-central1
### Required
export MONGO_DB_URI=<mongoDbUri>
export DATABASE=<database>
export COLLECTION=<collection>
export USER_OPTION=NONE
export OUTPUT_TABLE_SPEC=<outputTableSpec>
### Optional
export KMSENCRYPTION_KEY=<KMSEncryptionKey>
export FILTER=<filter>
export USE_STORAGE_WRITE_API=false
export USE_STORAGE_WRITE_API_AT_LEAST_ONCE=false
export BIG_QUERY_SCHEMA_PATH=<bigQuerySchemaPath>
export JAVASCRIPT_DOCUMENT_TRANSFORM_GCS_PATH=<javascriptDocumentTransformGcsPath>
export JAVASCRIPT_DOCUMENT_TRANSFORM_FUNCTION_NAME=<javascriptDocumentTransformFunctionName>
mvn clean package -PtemplatesRun \
-DskipTests \
-DprojectId="$PROJECT" \
-DbucketName="$BUCKET_NAME" \
-Dregion="$REGION" \
-DjobName="mongodb-to-bigquery-job" \
-DtemplateName="MongoDB_to_BigQuery" \
-Dparameters="mongoDbUri=$MONGO_DB_URI,database=$DATABASE,collection=$COLLECTION,userOption=$USER_OPTION,KMSEncryptionKey=$KMSENCRYPTION_KEY,filter=$FILTER,useStorageWriteApi=$USE_STORAGE_WRITE_API,useStorageWriteApiAtLeastOnce=$USE_STORAGE_WRITE_API_AT_LEAST_ONCE,outputTableSpec=$OUTPUT_TABLE_SPEC,bigQuerySchemaPath=$BIG_QUERY_SCHEMA_PATH,javascriptDocumentTransformGcsPath=$JAVASCRIPT_DOCUMENT_TRANSFORM_GCS_PATH,javascriptDocumentTransformFunctionName=$JAVASCRIPT_DOCUMENT_TRANSFORM_FUNCTION_NAME" \
-f v2/mongodb-to-googlecloud
Dataflow supports the utilization of Terraform to manage template jobs, see dataflow_flex_template_job.
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters specific to the template. If available, they may be used instead of dataflow_flex_template_job directly.
To use the autogenerated module, execute the standard terraform workflow:
cd v2/mongodb-to-googlecloud/terraform/MongoDB_to_BigQuery
terraform init
terraform apply
To use dataflow_flex_template_job directly:
provider "google-beta" {
project = var.project
}
variable "project" {
default = "<my-project>"
}
variable "region" {
default = "us-central1"
}
resource "google_dataflow_flex_template_job" "mongodb_to_bigquery" {
provider = google-beta
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/MongoDB_to_BigQuery"
name = "mongodb-to-bigquery"
region = var.region
parameters = {
mongoDbUri = "<mongoDbUri>"
database = "<database>"
collection = "<collection>"
userOption = "NONE"
outputTableSpec = "<outputTableSpec>"
# KMSEncryptionKey = "<KMSEncryptionKey>"
# filter = "<filter>"
# useStorageWriteApi = "false"
# useStorageWriteApiAtLeastOnce = "false"
# bigQuerySchemaPath = "<bigQuerySchemaPath>"
# javascriptDocumentTransformGcsPath = "<javascriptDocumentTransformGcsPath>"
# javascriptDocumentTransformFunctionName = "<javascriptDocumentTransformFunctionName>"
}
}