Skip to content

Latest commit

 

History

History
540 lines (402 loc) · 23.2 KB

File metadata and controls

540 lines (402 loc) · 23.2 KB

Deploy the Federated learning reference architecture on Google Cloud

This document shows how to deploy the Google Cloud Federated Learning (FL) reference architecture.

To deploy this reference architecture, you need:

This reference architecture builds on the Reference implementation for the Core GKE Accelerated Platform. In this document, we reference the Reference implementation for the Core GKE Accelerated Platform as core platform. The deployment procedure described in the Deploy the reference architecture section takes care of deploying an instance of the core platform for you.

Service account roles and permissions

You can choose between Project Owner access or Granular Access for more fine-tuned permissions.

Option 1: Project Owner role

The service account will have full administrative access to the project.

Option 2: Granular Access

The service account will be assigned the following roles to limit access to required resources:

  • roles/artifactregistry.admin: Grants full administrative access to Artifact Registry, allowing management of repositories and artifacts.
  • roles/browser: Provides read-only access to browse resources in a project.
  • roles/cloudkms.admin: Provides full administrative control over Cloud KMS (Key Management Service) resources.
  • roles/compute.networkAdmin: Grants full control over Compute Engine network resources.
  • roles/container.clusterAdmin: Provides full control over Google Kubernetes Engine (GKE) clusters, including creating and managing clusters.
  • roles/gkehub.editor: Grants permission to manage GKE Hub features.
  • roles/iam.serviceAccountAdmin: Grants full control over managing service accounts in the project.
  • roles/resourcemanager.projectIamAdmin: Allows managing IAM policies and roles at the project level.
  • roles/servicenetworking.serviceAgent: Allows managing service networking configurations.
  • roles/serviceusage.serviceUsageAdmin: Grants permission to enable and manage services and APIs for a project.

Understand the repository structure

The platforms/gke/base/use-cases/federated-learning use case has the following directories and files:

  • terraform: contains Terraform descriptors and configuration to deploy the reference architecture.
  • deploy.sh: convenience script to deploy the reference architecture.
  • teardown.sh: convenience script to destroy the reference architecture.
  • common.sh: contains common shell variables and functions.
  • assets: contains documentation static assets.
  • README.md: this document.

Architecture

The following diagram describes the architecture that you can create with this reference architecture:

alt_text

As shown in the preceding diagram, the reference architecture helps you create and configure the following infrastructure components:

  • A Virtual Private Cloud (VPC) network and subnets.

  • A private GKE cluster that helps you:

    • Isolate cluster nodes from the internet.
    • Limit exposure of your cluster nodes and control plane to the internet.
    • Use shielded GKE nodes.
    • Enable Dataplane V2 for optimized Kubernetes networking.
    • Encrypt cluster secrets at the application layer.
  • Dedicated GKE node pools to isolate workloads from each other in dedicated runtime environments.

  • For each GKE node pool, the reference architecture creates a dedicated Kubernetes namespace. The Kubernetes namespace and its resources are treated as a tenant within the GKE cluster.

  • For each GKE node, the reference architecture configures Kubernetes taints to ensure that only the tenant's workloads are schedulable onto the GKE nodes belonging to a particular tenant.

  • A GKE node pool (system) to host coordination and management workloads that aren't tied to specific tenants.

  • Firewall policies to limit ingress and egress traffic from GKE node pools, unless explicitly allowed.

  • Cloud NAT to allow egress traffic to the internet, only if allowed.

  • Cloud DNS records to enable Private Google Access such that workloads within the cluster can access Google APIs without traversing the internet.

  • Cloud Identity and Access Management (IAM) service accounts:

  • An Artifact Registry repository to store container images for your workloads.

  • Config Sync to sync cluster configuration and policies from a Git repository or an OCI-compliant repository. Users and teams managing workloads should not have permissions to change cluster configuration or modify service mesh resources unless explicitly allowed by your policies.

  • An Artifact Registry repository to store Config Sync configurations.

  • Policy Controller to enforce policies on resources in the GKE cluster to help you isolate workloads.

  • Cloud Service Mesh to control and help secure network traffic.

Config Sync applies the following Policy controller and Cloud Service Mesh controls to each Kubernetes namespace:

  • By default, deny all ingress and egress traffic to and from pods. This rule acts as baseline 'deny all' rule.
  • Allow egress traffic to required cluster resources such as the GKE control plane.
  • Allow egress traffic only to known hosts.
  • Allow ingress and egress traffic that originate from within the same namespace.
  • Allow ingress and egress traffic between pods in the same namespace.
  • Allow egress traffic to Google APIs only using Private Google Access.

Deploy the reference architecture

To deploy the reference architecture, you do the following:

  1. Open Cloud Shell.

  2. Clone this repository and change the working directory:

    git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \
    cd accelerated-platforms
  3. Configure the ID of the Google Cloud project where you want to initialize the provisioning and configuration environment. This project will also contain the remote Terraform backend. Add the following content to platforms/gke/base/_shared_config/terraform.auto.tfvars:

    terraform_project_id = "<CONFIG_PROJECT_ID>"

    Where:

    • <CONFIG_PROJECT_ID> is the Google Cloud project ID.
  4. Configure the ID of the Google Cloud project where you want to deploy the reference architecture by adding the following content to platforms/gke/base/_shared_config/cluster.auto.tfvars:

    cluster_project_id = "<PROJECT_ID>"

    Where:

    • <PROJECT_ID> is the Google Cloud project ID. Can be different from <CONFIG_PROJECT_ID>.
  5. Optionally configure a unique identifier to append to the name of all the resources in the reference architecture to identify a particular instance of the reference architecture, and to allow for multiple instances of the reference architecture to be deployed in the same Google Cloud project. To optionally configure the unique prefix, add the following content to platforms/gke/base/_shared_config/platform.auto.tfvars:

    resource_name_prefix = "<RESOURCE_NAME_PREFIX>"
    platform_name        = "<PLATFORM_NAME>"

    Where:

    • <RESOURCE_NAME_PREFIX> and <PLATFORM_NAME> are strings that compose the unique identifier to append to the name of all the resources in the reference architecture.

    When you set resource_name_prefix and platform_name, we recommend that you avoid long strings because the might make resource naming validation to fail because the resource name might be too long.

  6. Run the script to provision the reference architecture:

    "platforms/gke/base/use-cases/federated-learning/deploy.sh"

It takes about 20 minutes to provision the reference architecture.

Understand the deployment and destroy processes

The platforms/gke/base/use-cases/federated-learning/deploy.sh script is a convenience script to orchestrate the provisioning and configuration of an instance of the reference architecture. platforms/gke/base/use-cases/federated-learning/deploy.sh does the following:

  1. Configure environment variables to reference libraries and other dependencies.
  2. Initialize the core platform configuration files.
  3. Initialize the core platform by running the core platform initialize service.
  4. Provision and configure Google Cloud resources that the core platform depends on.
  5. Provision and configure an instance of the core platform.
  6. Provision and configures Google Cloud resources that the FL reference architecture depends on, augmenting the core platform.

The platforms/gke/base/use-cases/federated-learning/teardown.sh script is a convenience script to orchestrate the destruction of an instance of the reference architecture. platforms/gke/base/use-cases/federated-learning/teardown.sh performs actions that are opposite to platforms/gke/base/use-cases/federated-learning/deploy.sh, in reverse order.

Next steps

After deploying the reference architecture, the GKE cluster is ready to host your federated learning workloads. For example, you can:

Destroy the reference architecture

To destroy an instance of the reference architecture, you do the following:

  1. Open Cloud Shell.

  2. Run the script to destroy the reference architecture:

    "platforms/gke/base/use-cases/federated-learning/teardown.sh"

Configure the Federated learning reference architecture

You can configure the reference architecture by modifying files in the following directories:

  • platforms/gke/base/_shared_config
  • platforms/gke/base/use-cases/federated-learning/terraform/_shared_config

To add files to the package that Config Sync uses to sync cluster configuration:

  1. Copy the additional files in the platforms/gke/base/use-cases/federated-learning/terraform/config_management/files/additional directory.
  2. Run the platforms/gke/base/use-cases/federated-learning/deploy.sh script.

Configure isolated runtime environments

The reference architecture configures runtime environments that are isolated from each other. Each runtime environment gets:

  • A dedicated Kubernetes Namespace
  • A dedicated GKE node pool

These isolated runtime environments are defined as tenants.

For more information about the design of these tenants, see Federated Learning reference architecture.

By default, this reference architecture configures one tenant. To configure additional tenants, or change their names, set the value of the federated_learning_tenant_names Terraform variable in platforms/gke/base/use-cases/federated-learning/terraform/_shared_config/uc_federated_learning.auto.tfvars according to how many tenants you need. For example, to create two isolated tenants named fl-1 and fl-2, you set the federated_learning_tenant_names variable as follows:

federated_learning_tenant_names = [
  "fl-1",
  "fl-2",
]

For more information about the federated_learning_tenant_names, see its definition in platforms/gke/base/use-cases/federated-learning/terraform/_shared_config/uc_federated_learning_variables.tf

Enable Confidential GKE Nodes

The reference architecture can optionally configure Confidential GKE Nodes using Terraform. To enable Confidential GKE Nodes, you do the following:

  1. Initialize the following Terraform variables in platforms/gke/base/_shared_config/cluster.auto.tfvars:

    1. Set cluster_confidential_nodes_enabled to true

    2. Set cluster_system_node_pool_machine_type to a machine type that supports Confidential GKE Nodes. For more information about the machine types that support Confidential GKE Nodes, see Encrypt workload data in-use with Confidential GKE Nodes.

  2. Initialize the following Terraform variables in platforms/gke/base/use-cases/federated-learning/terraform/_shared_config/uc_federated_learning.auto.tfvars:

    1. Set federated_learning_node_pool_machine_type to a machine type that supports Confidential GKE Nodes.

Allow desired network traffic

  1. Configure Kubernetes network policies to allow traffic. You can see how current Kubernetes network policies are affecting traffic in your cluster using Cloud Logging.

  2. If your workloads need to access hosts that are external to the service mesh, configure a ServiceEntry for each external host.

  3. If your workloads need to send traffic outside the cluster, configure:

    • AuthorizationPolicies to allow traffic from the workload namespace to the istio-egress namespace.
    • VirtualServices to direct traffic from the workload to the egress gateway, and from the egress gateway to the destination.
    • NetworkPolicies to allow egress traffic from their workspace.
  4. If your workloads need to receive traffic from outside the cluster, configure:

    • AuthorizationPolicies to allow traffic from the istio-ingress namespace to the workload namespace.
    • VirtualServices to direct traffic from the external service to the ingress gateway, and from the ingress gateway to the workload.
    • NetworkPolicies to allow ingress traffic to their workspace.

Troubleshooting

This section describes common issues and troubleshooting steps.

Network address assignment errors when running Terraform

If Terraform reports connect: cannot assign requested address errors when you run Terraform, try running the command again.

Errors when provisioning the reference architecture

  • Cloud Shell has 5GBs of available disk space. Depending on your Cloud Shell usage, it might not be enough for deploying the reference architecture, unless you enable Terraform plugin caching to enable reusing plugins and providers, instead of downloading multiple copies of each plugin and provider. Symptoms of this issue are errors like the following:

    │ Error: Failed to install provider
    │
    │ Error while installing hashicorp/google v6.12.0: write
    │ .terraform/providers/registry.terraform.io/hashicorp/google/6.12.0/linux_amd64/terraform-provider-google_v6.12.0_x5: no space left on device
    ╵
    
  • If Cloud Service Mesh is reported as Pending enablement state in the GKE Enterprise feature dashboard, If this error occurs, try disabling and re-enabling Cloud Service Mesh:

    terraform -chdir=platforms/gke/base/core/gke_enterprise/servicemesh init && \
      terraform -chdir=platforms/gke/base/core/gke_enterprise/servicemesh destroy -auto-approve -input=false && \
      terraform -chdir=platforms/gke/base/core/gke_enterprise/servicemesh apply -input=false
  • Client-side tools and Cloud Shell authenticate with Google Cloud using a short-lived token. If the token expires, you might receive errors similar to the following:

    │ Error: Error when reading or editing Project "PROJECT_ID": Get "https://cloudresourcemanager.googleapis.com/v1/projects/PROJECT_ID?alt=json&prettyPrint=false": oauth2/google: invalid token JSON from metadata: EOF
    │
    │   with data.google_project.cluster,
    │   on project.tf line 15, in data "google_project" "cluster":
    │   15: data "google_project" "cluster" {
    

    If this error occurs, try reloading Cloud Shell.

Errors when adding the GKE cluster to the Fleet

If Terraform reports errors about the format of the fleet membership configuration, it may mean that the Fleet API initialization didn't complete when Terraform tried to add the GKE cluster to the fleet. Example:

Error creating FeatureMembership: googleapi: Error 400: InvalidValueError for
field membership_specs["projects/<project number>/locations/global/memberships/<cluster name>"].feature_spec:
does not match a current membership in this project. Keys should be in the form: projects/<project number>/locations/{l}/memberships/{m}

If this error occurs, try running terraform apply again.

Errors when enabling GKE Enterprise features

  • GKE Enterprise features already enabled in the Google Cloud project:

    Error: Error creating Feature: googleapi: Error 409: Resource
    'projects/PROJECT_NAME/locations/global/features/configmanagement' already
    exists
    

    To avoid this error, you can either:

    • Deploy the reference architecture in a new Google Cloud project, where GKE Enterprise features are not already enabled, so that the reference architecture can manage them.

    • Import the gke_hub_feature resources in the Terraform state, so that Terraform is aware of them. In this case, Terraform will also apply any configuration changes that the reference architecture requires. Before you import gke_hub_feature resources in the Terraform state, we recommend that you assess the impact on other GKE clusters in the same project that depend on those resources. For example, when you destroy this reference architecture, these resources will be destroyed too, potentially impacting other GKE clusters in the same project.

      For example, you can run the following command from the root directory of this repository to import the configmanagement feature:

      terraform -chdir=platforms/gke/base/core/gke_enterprise/configmanagement/oci init && \
        terraform -chdir=platforms/gke/base/core/gke_enterprise/configmanagement/oci import \
        google_gke_hub_feature.configmanagement \
        projects/<PROJECT_ID>/locations/global/features/configmanagement

      As another example, you can run the following command from the root directory of this repository to import the policycontroller feature:

      terraform -chdir=platforms/gke/base/core/gke_enterprise/policycontroller init && \
        terraform -chdir=platforms/gke/base/core/gke_enterprise/policycontroller import \
        google_gke_hub_feature.policycontroller \
        projects/<PROJECT_ID>/locations/global/features/policycontroller

      Where:

      • <PROJECT_ID> is the Google Cloud project ID where you deployed the reference architecture.

Errors when pulling container images

If istio-ingress or istio-egress Pods fail to run because GKE cannot download their container images and GKE reports ImagePullBackOff errors, see Troubleshoot gateways for details about the potential root cause. You can inspect the status of these Pods in the GKE Workloads Dashboard.

If this happens:

  1. Wait for the cluster to complete the initialization
  2. Delete the Deployment that is impacted by this issue. Config Sync will deploy it again with the correct container image identifiers.

Errors when deleting and cleaning up the environment

When running terraform destroy to remove resources that this reference architecture provisioned and configured, it might happen that you get the following errors:

  • Dangling network endpoint groups (NEGs):

    Error waiting for Deleting Network: The network resource
    'projects/PROJECT_NAME/global/networks/NETWORK_NAME' is already being used
    by
    'projects/PROJECT_NAME/zones/ZONE_NAME/networkEndpointGroups/NETWORK_ENDPOINT_GROUP_NAME'.
    

    If this happens, see the note at the end of Uninstall Cloud Service Mesh

Understanding security controls

For more information about the controls that this reference architecture implements to help you secure your environment, see GKE security controls.

What's next

For a complete overview about how to implement Federated Learning on Google Cloud, see Cross-silo and cross-device federated learning on Google Cloud.