Skip to content

dft street manager spike 0014 k8s gcp

Michal Schott edited this page Feb 6, 2018 · 11 revisions

GCE Self-managed K8s

This ODP describe the design, implementation and learnings from spike STREEWORK-199.

High-Level PoC Design

Internet publishing approach

To expose service to public Kubernetes uses Ingresses. This approach uses single cloud-provided load balancer and Nginx-as-reverse-proxy pods. In that scenario we're able to use Nginx in it's full potential (ie. TLS termination, header modifications, redirects etc.).

Another approach is to publish service directly using cloud-provided load balancers. The downside is lack of control over Layer7.

Implementation Tooling

Terraform will be used to provide most necessary components. This is well proven tool, which allows to use Infrastracture-as-a-Code approach. Some updates recently have been provided to allow Terraform to now create Google Container Registry items using updated Google provider for Terraform using from updated GO package resources.

Other resources can also be controlled through the use of Google Cloud SDK

Kubernetes cluster will be provided and maintained using KOPS tool. KOPS uses Google Storage Bucket as a backend to store configuration. It also allows to export/import cluster configuration in/from YAML format files according to documentation. All changes are being deployed partially using rolling deployment approach and verified afterwards.

This approach allows us to do proper code reviews, test and promote changes, version configuration using source version control (ie. Git) and deploy small changes quickly while not having any service downtime.

K8s cluster reconfiguration / scale-up / scale-down approach

As configuration will be stored as a code in Git, WebOps engineer will create a branch, do necessary changes, push local branch to origin server and open a Pull Request and assign it to senior team member. After code review and passing tests, senior team member will merge pull request to master branch. CI server will deploy the delta.

Application rolling-update approach

Until not specified otherwise, Kubernetes uses rolling update approach by default. Any update to deployment resource (ie. container image update) will be treated that way.

All configuration will reside as a YAML files in Git. This approach allows us to do proper code reviews, test and promote changes, version configuration and deploy/promote releases quickly while not having any service downtime.

Application scale-up & scale down approach

Application scaling can be done using Horizontal Pod Autoscaler based on various metrics. Most common to use is CPU usage, but this is well-know as not being much effective as Latency.

Required CICD tooling, code

CircleCI/Travis is sufficient, as long as we can configure kubectl credentials for interaction with Kubernetes clusters. Both solutions are cloud based and well integrated with GitHub.

We might require to build and maintain small Jenkins instance to use Terraform more efficient.

FAQ

Q: How would we automate the testing & deployment of our apps - ie what are the industry/community-recommended approaches for CICD within this domain?

The easiest way to accomplish the goal would be to use containers. Docker suite contains docker compose tool, which can be used to run whole applications (with mocks if required) or some part of it.

It is much easier to implement proper development process and setup pipelines within any CDCD tool, because once app is containerised you can use the same container in every environment.

Q: What might the developer experience and workflow look like? Will developers be able to have full access to investigate and resolve CICD pipeline issues, for example? Will they be able to define and deploy apps to dev environments with zero ops involvement?

Developers will be able to build and run containers locally.

Once their code is pushed to remote repo / merged to proper branch, CICD will build and publish (upload to private repo) container.

CICD will take care for development deployments. For any other deployments (qa, demo, stage, production, ....) we will be using Helm.

Q: How much of the solution can be covered by IaaC tooling? What gaps are there? What tools/approaches are available to plug those gaps?

Google API very open and Terraform coverage is low (ie. cluster reconfiguration is equal to rebuild whole cluster). Community tends to not use any IaaC tool - they rather use gcloud CLI.

Because of that, we also tried to setup self-managed cluster using KOPS.

Q: What would the patching process look like? How could this be achieved without impacting service?

Application patching process is no different than any usual release.

Kubernetes-As-A-Service - provider takes care about infrastructure patches.

KOPS only - Server patching is possible via rolling update approach. Ad-hoc patching can also be done via SSH.

Q: Are internal services deployed with TLS enabled? If not, is it an option?

Initially no TLS certificates are deployed.

Enabling TLS for internal services might require to also deploy internal-nginx-rev-proxy containers.

SNI functionality is now supported for TLS for ingress controllers such as heptio/contour and Envoy. In short Kubernetes now supports multiple certificates through the use of kubernetes.io/ingress.class without having to use nginx.

Certificates can be provided via built-in secret storage.

For external-exposed services TLS can be managed using kube-lego or provided via build-in secret storage.

Q: Is it possible to use acls/security groups to whitelist trusted IPs/ports at the perimeter? eg locking down cluster/apps in dev environments to Kainos IPs only

Kubernetes API and SSH access can be restricted to certain CIDRs.

ACLs can be sets inside Kubernetes.

Q: Can the service provide fully automatic auto-scaling (and auto-healing)?

Auto-healing is implemented by design. Additionally pod health checks will be implemented.

Application auto-scaling can be implemented using Horizontal Pod Autoscaler.

Cluster auto-scaling can be implemented using Cluster Autoscaler.

Q: How does outbound internet access work, eg for third party API calls? Do outbound calls originate from a fixed IP, to allow IP whitelisting on the remote side?

By default everything is public. It might be possible to create custom managed and maintained VM, which will act as a NAT Gateway.

Q: What technical gaps, limitations or technical challenges were encountered with the technology, tooling or process, during the spike?

Getting KOPS up and running from the limited documentation scattered on the internet made configuring KOPS on Google difficult. The product does have most of the equivalent features such as the more mature product on AWS but it harder to implement. Google does have one ace up it's sleeve with the advanced development of Kubernetes back in 2014. Cluster Autoscaler works with GCE and GKE. There is more maturity with the base product of Kubernetes running within the Google cloud. Some features in Terraform for configuration have only recently been implemented such as the Google Container Registry Object.

There are some features in KOPS which have only been implemented very recently. Security features such as RBAC which is now default in version 1.8.0 of KOPS. Whilst some issues like admins gaining access to all resources has been addressed in the latest update of KOPS. An example of the previous security breach can be found here. Though this info is provided for AWS the same was true for Google.

Please also see https://github.com/KainosSoftwareLtd/dft-street-manager-alpha/wiki/dft-street-manager-spike-0017-k8s-gke.

There is no possibility to use PostgreSQL database in private network, but you can authorise external network access.

Clone this wiki locally