|
| 1 | +# Cloud Dataproc API Examples |
| 2 | + |
| 3 | +[![Open in Cloud Shell][shell_img]][shell_link] |
| 4 | + |
| 5 | +[shell_img]: http://gstatic.com/cloudssh/images/open-btn.png |
| 6 | +[shell_link]: https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataproc/README.md |
| 7 | + |
| 8 | +Sample command-line programs for interacting with the Cloud Dataproc API. |
| 9 | + |
| 10 | +See [the tutorial on the using the Dataproc API with the Python client |
| 11 | +library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example) |
| 12 | +for information on a walkthrough you can run to try out the Cloud Dataproc API sample code. |
| 13 | + |
| 14 | +Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI. |
| 15 | + |
| 16 | +`list_clusters.py` is a simple command-line program to demonstrate connecting to the Cloud Dataproc API and listing the clusters in a region. |
| 17 | + |
| 18 | +`submit_job_to_cluster.py` demonstrates how to create a cluster, submit the |
| 19 | +`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result. |
| 20 | + |
| 21 | +`single_job_workflow.py` uses the Cloud Dataproc InstantiateInlineWorkflowTemplate API to create an ephemeral cluster, run a job, then delete the cluster with one API request. |
| 22 | + |
| 23 | +`pyspark_sort.py_gcs` is the same as `pyspark_sort.py` but demonstrates |
| 24 | + reading from a GCS bucket. |
| 25 | + |
| 26 | +## Prerequisites to run locally: |
| 27 | + |
| 28 | +* [pip](https://pypi.python.org/pypi/pip) |
| 29 | + |
| 30 | +Go to the [Google Cloud Console](https://console.cloud.google.com). |
| 31 | + |
| 32 | +Under API Manager, search for the Google Cloud Dataproc API and enable it. |
| 33 | + |
| 34 | +## Set Up Your Local Dev Environment |
| 35 | + |
| 36 | +To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/) |
| 37 | +(recommended), run the commands within a virtualenv. |
| 38 | + |
| 39 | + * pip install -r requirements.txt |
| 40 | + |
| 41 | +## Authentication |
| 42 | + |
| 43 | +Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/). |
| 44 | +The recommended approach to running these samples is a Service Account with a JSON key. |
| 45 | + |
| 46 | +## Environment Variables |
| 47 | + |
| 48 | +Set the following environment variables: |
| 49 | + |
| 50 | + GOOGLE_CLOUD_PROJECT=your-project-id |
| 51 | + REGION=us-central1 # or your region |
| 52 | + CLUSTER_NAME=waprin-spark7 |
| 53 | + ZONE=us-central1-b |
| 54 | + |
| 55 | +## Running the samples |
| 56 | + |
| 57 | +To run list_clusters.py: |
| 58 | + |
| 59 | + python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION |
| 60 | + |
| 61 | +`submit_job_to_cluster.py` can create the Dataproc cluster or use an existing cluster. To create a cluster before running the code, you can use the [Cloud Console](console.cloud.google.com) or run: |
| 62 | + |
| 63 | + gcloud dataproc clusters create your-cluster-name |
| 64 | + |
| 65 | +To run submit_job_to_cluster.py, first create a GCS bucket (used by Cloud Dataproc to stage files) from the Cloud Console or with gsutil: |
| 66 | + |
| 67 | + gsutil mb gs://<your-staging-bucket-name> |
| 68 | + |
| 69 | +Next, set the following environment variables: |
| 70 | + |
| 71 | + BUCKET=your-staging-bucket |
| 72 | + CLUSTER=your-cluster-name |
| 73 | + |
| 74 | +Then, if you want to use an existing cluster, run: |
| 75 | + |
| 76 | + python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET |
| 77 | + |
| 78 | +Alternatively, to create a new cluster, which will be deleted at the end of the job, run: |
| 79 | + |
| 80 | + python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster |
| 81 | + |
| 82 | +The script will setup a cluster, upload the PySpark file, submit the job, print the result, then, if it created the cluster, delete the cluster. |
| 83 | + |
| 84 | +Optionally, you can add the `--pyspark_file` argument to change from the default `pyspark_sort.py` included in this script to a new script. |
0 commit comments