Skip to content

Commit e102a16

Browse files
committed
ci-operator/templates/openshift/installer/cluster-launch-installer-e2e: Gather node console logs on AWS
To help debug things like [1]: Dec 2 16:31:41.298: INFO: cluster upgrade is Failing: Cluster operator kube-apiserver is reporting a failure: NodeControllerDegraded: The master node(s) "ip-10-0-136-232.ec2.internal" not ready ... Kubelet stopped posting node status. where a node goes down but does not come back up far enough to reconnect as a node. Eventually, we'll address this with machine-health checks, killing the non-responsive machine and automatically replacing it with a new one. That's currently waiting on an etcd operator that can handle reconnecting control-plane machines automatically. But in the short term, and possibly still in the long term, it's nice to collect what we can from the broken machine to understand why it didn't come back up. This code isn't specific to broken machines, but collecting console logs from all nodes should cover us in the broken-machine case as well. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1778904
1 parent c2932ea commit e102a16

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed

ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -785,6 +785,10 @@ objects:
785785
value: /etc/openshift-installer/gce.json
786786
- name: KUBECONFIG
787787
value: /tmp/artifacts/installer/auth/kubeconfig
788+
- name: USER
789+
value: test
790+
- name: HOME
791+
value: /tmp
788792
command:
789793
- /bin/bash
790794
- -c
@@ -852,6 +856,7 @@ objects:
852856
fi
853857
854858
oc --insecure-skip-tls-verify --request-timeout=5s get nodes -o jsonpath --template '{range .items[*]}{.metadata.name}{"\n"}{end}' > /tmp/nodes
859+
oc --insecure-skip-tls-verify --request-timeout=5s get nodes -o jsonpath --template '{range .items[*]}{.spec.providerID}{"\n"}{end}' | sed 's|.*/||' > /tmp/node-provider-IDs
855860
oc --insecure-skip-tls-verify --request-timeout=5s get pods --all-namespaces --template '{{ range .items }}{{ $name := .metadata.name }}{{ $ns := .metadata.namespace }}{{ range .spec.containers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ range .spec.initContainers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ end }}' > /tmp/containers
856861
oc --insecure-skip-tls-verify --request-timeout=5s get pods -l openshift.io/component=api --all-namespaces --template '{{ range .items }}-n {{ .metadata.namespace }} {{ .metadata.name }}{{ "\n" }}{{ end }}' > /tmp/pods-api
857862
@@ -892,6 +897,21 @@ objects:
892897
queue /tmp/artifacts/nodes/$i/heap oc --insecure-skip-tls-verify get --request-timeout=20s --raw /api/v1/nodes/$i/proxy/debug/pprof/heap
893898
done < /tmp/nodes
894899
900+
if [[ "${CLUSTER_TYPE}" = "aws" ]]; then
901+
# FIXME: get epel-release or otherwise add awscli to our teardown image
902+
export PATH="${HOME}/.local/bin:${PATH}"
903+
easy_install --user pip # our Python 2.7.5 is even too old for ensurepip
904+
pip install --user awscli
905+
export AWS_REGION="$(python -c 'import json; data = json.load(open("/tmp/artifacts/installer/metadata.json")); print(data["aws"]["region"])')"
906+
fi
907+
908+
while IFS= read -r i; do
909+
mkdir -p "/tmp/artifacts/nodes/${i}"
910+
if [[ "${CLUSTER_TYPE}" = "aws" ]]; then
911+
queue /tmp/artifacts/nodes/$i/console aws ec2 get-console-output --instance-id "${i}"
912+
fi
913+
done < /tmp/node-provider-IDs
914+
895915
FILTER=gzip queue /tmp/artifacts/nodes/masters-journal.gz oc --insecure-skip-tls-verify adm node-logs --role=master --unify=false
896916
FILTER=gzip queue /tmp/artifacts/nodes/workers-journal.gz oc --insecure-skip-tls-verify adm node-logs --role=worker --unify=false
897917

0 commit comments

Comments
 (0)