This document describes a test plan for testing zero-downtime upgrades of NGF.
Zero-downtime upgrades means that during an NGF upgrade clients don't experience any interruptions to the traffic they send to applications exposed via NGF.
- Ensure that upgrading NGF doesn't lead to any loss of traffic flowing through the data plane.
- Ensure that after an upgrade, NGF can process changes to resources.
- Detect if any special instructions will be required to provide to users to perform an upgrade.
During an upgrade, Kubernetes will shut down existing NGF Pods by sending a SIGTERM. If the Pod doesn't terminate in 30 seconds (the default period) , Kubernetes will send a SIGKILL.
When proxying Websocket or any long-lived connections, NGINX will not terminate until that connection is closed by either the client or the backend. This means that unless all those connections are closed by clients/backends before or during an upgrade (which is highly unlikely), NGINX will not terminate, which means Kubernetes will kill NGINX. As a result, the clients will see the connections abruptly closed and thus experience downtime.
As a result, we will not use any long-live connections in this test, because NGF cannot support zero-downtime upgrades in this case.
- A Kubernetes cluster with 10 nodes on GKE
- Node: e2-medium (2 vCPU, 4GB memory)
- Enabled GKE logging.
- Tester VMs on Google Cloud Platform:
- Configuration:
- Debian
- Install packages: wrk, curl, gnuplot
- Location - same zone as the Kubernetes cluster.
- First VM for HTTP traffic
- Second VM - for sending HTTPs traffic
- Configuration:
- NGF
- Deployment with 2 replicas scheduled on different nodes.
- Exposed via a Service with type LoadBalancer, private IP
- Gateway, two listeners - HTTP and HTTPs
- Two backends:
- Coffee - 3 replicas
- Tea - 3 replicas
- Two HTTPRoutes
- Coffee (HTTP)
- Tea (HTTPS)
Notes:
- For sending traffic, we will use both wrk and curl.
- wrk will generate a lot of traffic continuously, and it will have a high chance of catching of any (however small) periods of downtime.
- curl will generate 1 request every 0.1s. While it might not catch small periods of downtime, it will give us timeline of failed request for big periods of downtime, which wrk doesn't do.
- We use Pod anti-affinity to tell Kubernetes to schedule NGF Pods on different nodes. We also use a 10 node cluster so that the chance of Kubernetes scheduling new Pods on the same nodes is minimal. Scheduling new Pods on different nodes will help better catch any interdependencies with an external load balancer (typically the node of a new Pod will be added to the pool in the load balancer, and the node of an old one will be removed).
-
Create a cluster.
-
Deploy a previous latest stable version with 2 replicas with added anti-affinity.
-
Expose NGF via a Service Load Balancer, internal (only accessible within the Google Cloud region) by adding
networking.gke.io/load-balancer-type: "Internal"
annotation to the Service. -
Deploy backend apps:
kubectl apply -f manifests/cafe.yaml
-
Configure Gateway:
kubectl apply -f manifests/cafe-secret.yaml kubectl apply -f manifests/gateway.yaml
-
Expose apps via HTTPRoutes
kubectl apply -f manifests/cafe-routes.yaml
-
Check statuses of the Gateway and HTTPRoutes for errors.
-
In Google Monitoring, check NGF and NGINX error logs for errors.
-
In Tester VMs, update
/etc/hosts
to have an entry with the External IP of the NGF Service (10.128.0.10
in this case):10.128.0.10 cafe.example.com
-
Follow the upgrade instructions to:
- Upgrade Gateway API version to the one that matches the supported version of new release.
- Upgrade NGF CRDs.
-
Start sending traffic using wrk from tester VMs for 1 minute:
-
Tester VM 1:
-
wrk:
wrk -t2 -c100 -d60s --latency --timeout 2s http://cafe.example.com/coffee
-
curl:
for i in `seq 1 600`; do printf "\nRequest $i\n" && date --rfc-3339=ns && curl -sS --connect-timeout 2 http://cafe.example.com/coffee 2>&1 && sleep 0.1s; done > results.txt
-
-
Tester VM 2:
-
wrk:
wrk -t2 -c100 -d60s --latency --timeout 2s https://cafe.example.com/tea
-
curl:
for i in `seq 1 600`; do printf "\nRequest $i\n" && date --rfc-3339=ns && curl -k -sS --connect-timeout 2 https://cafe.example.com/tea 2>&1 && sleep 0.1s; done > results.txt
-
-
-
Immediately upgrade NGF manifests by following upgrade instructions.
Don't forget to modify the manifests to have 2 replicas and Pod affinity.
-
Ensure the new Pods are running and the old ones terminate.
-
Update the Gateway resource by adding one new listener
http-new
:kubectl apply -f manifests/gateway-updated.yaml
-
Check that at NGF has a leader elected among the new Pods:
kubectl -n nginx-gateway get lease
-
Ensure the status of the Gateway resource includes the new listener.
- Tester VMs:
- Analyze the output of wrk commands for errors and latencies.
- Create graphs from curl output (see instructions in Appendix) and check for any failures on them.
- Check the old Pods logs in Google Monitoring
-
NGINX Access logs - we expect only 200 responses. Google Monitoring query:
severity=INFO "GET" "HTTP/1.1" -"200"
-
NGINX Error logs - we expect no errors or warnings Google Monitoring query:
severity=ERROR SEARCH("`[warn]`") OR SEARCH("`[error]`")
-
NGF logs - we expect no errors
-
Specifically look at the NGF logs before it exited, to make sure all components shutdown correctly.
-
- Check the new Pods (in Google Monitoring)
- NGINX Access logs - only 200 responses.
- NGINX Error logs - no errors or warnings.
- NGF logs - no errors
-
To ensure Kubernetes doesn't schedule NGF Pods on the same nodes, use an anti-affinity rule:
spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchLabels: app.kubernetes.io/name: nginx-gateway
The output of a curl command is saved in results.txt
. To convert it into a graph,
go through the following steps:
-
Convert the output into a csv file:
awk ' /Request [0-9]+/ { getline datetime = $0 getline if ($1 == "curl:") { print datetime ",0" # Failed } else { print datetime ",1" # Success } }' results.txt > results.csv
-
Plot a graph using the csv file:
gnuplot requests-plot.gp
As a result, gnuplot will create
graph.png
with a graph. -
Download the resulting
graph.png
to you local machine. -
Also download
results.csv
.