Here we will go over some common tasks, related to utilizing RAPIDS on the GCP AI Platform. Note that strings containing '[YOUR_XXX]' indicate items that you will need to supply, based on your specific resource names and environment.
Motivation: We would like to create a GCP notebook with RAPIDS 0.18 release
Workflow: We will create a notebook instance using the RAPIDS 0.18 [Experimental]
env
- Log into your GCP console.
- Select AI-Platform -> Notebooks
- Select a "New Instance" -> "RAPIDS 0.18 [Experimental]"
- Select 'Install NVIDIA GPU driver automatically for me'
- Create
- Once JupterLab is running, you will have jupyter notebooks with rapids installed and rapids notebook examples under tutorials/RapidsAi.
To create an instance with A100s:
- Select "New Instance" -> "Customize instance"
- Select us-central1 region
- Select "RAPIDS 0.18 [Experimental]" Environment
- Choose A2 highgpu (for 1, 2 4 and 8 A100s) or A1 megagpu (16x A100s) as machine type
Motivation: We have an existing GCP notebook that we wish to update to support RAPIDS functionality. Workflow: We will create a notebook instance, and run a shell script that will install a Jupyter kernel and allow us to run RAPIDS based tasks.
- Log into your GCP console.
- Select AI-Platform -> Notebooks
- Select a "New Instance" -> "Python 3 (CUDA Toolkit 11.0)" -> With 1 NVIDIA Tesla T4
- Select 'Install NVIDIA GPU driver automatically for me'
- Create.
- Once JupyterLab is running
- Open a new terminal
- Run
RAPIDS_VER=21.06 CUDA_VER=11.0 wget -q https://data.rapids.ai/conda-pack/rapidsai/rapids${RAPIDS_VER}_cuda${CUDA_VER}_py3.8.tar.gz tar -xzf rapids${RAPIDS_VER}_cuda${CUDA_VER}_py3.8.tar.gz -C /opt/conda/envs/rapids_py38 conda activate rapids_py38 conda unpack ipython kernel install --user --name=rapids_py38
- Once completed, you will now have a new kernel in your jupyter notebooks called 'rapids_py38' which will have rapids installed.
Deploy a custom RAPIDS training container utilizing the 'airline dataset', and initiate a training job with support for HyperParameter Optimization (HPO)
Motivation: We would like to be able to utilize GCP's AI Platform for training a custom model, utilizing RAPIDS. Workflow: Install the required libraries, and authentication components for GCP, configure a storage bucket for persistent data, build our custom training container, upload the container, and launch a training job with HPO.
- Install GCP 'gcloud' SDK
- Configure gcloud authorization for docker on your build machine
- Configure a google cloud object storage bucket that will provide and output location
- Pull or build training containers and upload to GCR
- Pull
- Find the appropriate container: Here
docker tag <image> gcr.io/[YOUR_PROJECT_NAME]/rapids_training_container:latest
- Build
$ cd .
$ docker build --tag gcr.io/[YOUR_PROJECT_NAME]/rapids_training_container:latest --file common/docker/Dockerfile.training.unified .
$ docker push gcr.io/[YOUR_PROJECT_NAME]/rapids_training_container:latest
- Pull
- Training via GCP UI
- A quick note regarding GCP's cloudml-Hypertune
- This library interacts with the GCP AI Platform's HPO process by reporting required optimization metrics to the system after each training iteration.
hpt.report_hyperparameter_tuning_metric( hyperparameter_metric_tag='hpo_accuracy', metric_value=accuracy)
- For our purposes, the 'hyperparameter_metric_tag' should always correspond to the 'Metric to optimize' element passed to a job deployment.
- Training Algorithm
- From the GCP console select 'jobs' -> 'new training job' -> custom code training
- Choose 'Select a container image from the container Registry'
- Set 'Master image' to 'gcr.io/[YOUR_PROJECT_NAME]/rapids_training_container:latest'
- Set 'Job directory' to 'gs://[YOUR_GOOGLE_STORAGE_BUCKET]'
- Algorithm Arguments
- Ex:
-
--train --do-hpo --cloud-type=GCP --data-input-path=gs://[YOUR STORAGE BUCKET] --data-output-path=gs://[YOUR STORAGE BUCKET]/training_output --data-name=airline_20000000.orc
- With Hypertune
- Enter the hypertune parameters. Ex:
-
Argument name: hpo-max-depth Type: Integer Min: 2 Max: 8
-
Argumnet name: hpo-num-est Type: Integer Min: 100 Max: 200
-
Argument name: hpo-max-features Type: Double Min: 0.2 Max: 0.6
-
- Enter an optimizing metric. Ex:
- Enter the hypertune parameters. Ex:
- Job Settings
- A quick note regarding GCP's cloudml-Hypertune
- Training via gcloud job submission
- Update your training configuration based on 'example_config.json'
-
{ "trainingInput": { "args": [ "--train", "--do-hpo", "--cloud-type=GCP", "--data-input-path=gs://[YOUR STORAGE BUCKET]", "--data-output-path=gs://[YOUR STORAGE BUCKET]/training_output", "--data-name=airline_20000000.orc" ], "hyperparameters": { "enableTrialEarlyStopping": true, "goal": "MAXIMIZE", "hyperparameterMetricTag": "hpo_accuracy", "maxParallelTrials": 1, "maxTrials": 2, "params": [ { "maxValue": 200, "minValue": 100, "parameterName": "hpo-num-est", "type": "INTEGER" }, { "maxValue": 17, "minValue": 9, "parameterName": "hpo-max-depth", "type": "INTEGER" }, { "maxValue": 0.6, "minValue": 0.2, "parameterName": "hpo-max-features", "type": "DOUBLE" } ] }, "jobDir": "gs://[YOUR PROJECT NAME]/training_output", "masterConfig": { "imageUri": "gcr.io/[YOUR PROJECT NAME]/rapids_training_container:latest", "acceleratorConfig": { "count": "1", "type": "NVIDIA_TESLA_T4" } }, "masterType": "n1-standard-8", "region": "us-west1", "scaleTier": "CUSTOM" } }
- For more information, see:
-
- Run your training job
$ gcloud ai-platform jobs submit training [YOUR_JOB_NAME] --config ./example_config.json
- Monitor your training job
$ gcloud ai-platform jobs stream-logs [YOUR_JOB_NAME]
- Update your training configuration based on 'example_config.json'