Skip to content

gather/aws/console: Add a step to gather console logs on AWS #9743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

wking
Copy link
Member

@wking wking commented Jun 17, 2020

History of this logic:

The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.).

Approvers are:

  • Myself and @vrutkovs, who have touched this logic in the past.
  • @cgwalters, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues.
  • @enxebre, representing the machine-API space that needs console logs to debug failed-to-boot issues.

Feel free to recommend additional approvers :).

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2020
@openshift-ci-robot openshift-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Jun 17, 2020
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking: 1 invalid OWNERS file

In response to this:

History of this logic:

  • Initially landed for nodes in e102a16 (#6189).
  • Grew --text in 6ec5bf3 (#6536).
  • Grew machine handling in a469f53 (#6906).
  • The node-provider-IDs creation was ported to steps in e2fd5c7 (#6708), but without any consumer for the collected file.

The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.).

Approvers are:

  • Myself and @vrutkovs, who have touched this logic in the past.
  • @cgwalters, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues.
  • @enxebre, representing the machine-API space that needs console logs to debug failed-to-boot issues.

Feel free to recommend additional approvers :).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the aws-console-gather-step branch from dba286c to 32006d7 Compare June 17, 2020 22:20
Copy link
Contributor

@openshift-ci-robot openshift-ci-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking: 1 invalid OWNERS file

In response to this:

History of this logic:

  • Initially landed for nodes in e102a16 (#6189).
  • Grew --text in 6ec5bf3 (#6536).
  • Grew machine handling in a469f53 (#6906).
  • The node-provider-IDs creation was ported to steps in e2fd5c7 (#6708), but without any consumer for the collected file.

The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.).

Approvers are:

  • Myself and @vrutkovs, who have touched this logic in the past.
  • @cgwalters, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues.
  • @enxebre, representing the machine-API space that needs console logs to debug failed-to-boot issues.

Feel free to recommend additional approvers :).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking force-pushed the aws-console-gather-step branch from 32006d7 to 2a14db3 Compare June 17, 2020 22:27
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Jun 17, 2020
@cgwalters
Copy link
Member

Thanks for doing this, I and many others would have found this useful in the past.

But...I think coreos/ignition#585 would cover a lot of what this does in a much better way - most notably we could start to tie it into the machineAPI so we have a state between provisioning and "has a node" to e.g. "ignition succeeded" as well as "provisioning failed" instead of just a timeout.

@cgwalters
Copy link
Member

(Also that approach would work universally across other clouds, bare metal etc.)

@wking
Copy link
Member Author

wking commented Jun 17, 2020

But...I think coreos/ignition#585 would cover a lot of what this does in a much better way...

Sounds good to me, although there's a window where the machine is not alive enough to call itself in where polling for console logs would still be useful. And that ticket is from 2017? So regardless of whether we feel like my console polling is useful long-term, I think it's definitely useful in the short term. Or do you expect movement on the Ignition front in the next few months?

@cgwalters
Copy link
Member

So regardless of whether we feel like my console polling is useful long-term, I think it's definitely useful in the short term.

Agreed.

Or do you expect movement on the Ignition front in the next few months?

Probably not but I'm more trying to coax/trick/convince you into hacking on Ignition 😄

@wking wking force-pushed the aws-console-gather-step branch from 2a14db3 to 571dcc0 Compare June 17, 2020 23:58
@wking
Copy link
Member Author

wking commented Jun 17, 2020

Pushed 2a14db3a0b -> 571dcc0340 to address /bin/bash: line 12: TMPDIR: unbound variable.

History of this logic:

* Initially landed for nodes in e102a16
  (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e:
  Gather node console logs on AWS, 2019-12-02, openshift#6189).

* Grew --text in 6ec5bf3 (installer artifacts: keep text version of
  instance output, 2020-01-02, openshift#6536).

* Grew machine handling in a469f53
  (ci-operator/templates/openshift/installer/cluster-launch-installer-e2e:
  Gather console logs from Machine providerID too, 2020-01-29, openshift#6906).

* The node-provider-IDs creation was ported to steps in e2fd5c7
  (step-registry: update deprovision step, 2020-01-30, openshift#6708), but
  without any consumer for the collected file.

The aws-instance-ids.txt injection allows install-time steps to
register additional instances for later console collection (for a
proxy instance, bootstrap instance, etc.).

Approvers are:

* Myself and Vadim, who have touched this logic in the past.
* Alberto, representing the machine-API space that needs console logs
  to debug failed-to-boot issues.
* Colin, representing the RHCOS/machine-config space that needs
  console logs to debug RHCOS issues.

TMPDIR documentation is based on POSIX [1].

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03
@wking wking force-pushed the aws-console-gather-step branch from 571dcc0 to 390d6be Compare June 18, 2020 02:19
@wking
Copy link
Member Author

wking commented Jun 18, 2020

etcd-operator/e2e-aws had:

Gathering console logs for i-01b5fe4a9d7eea0d7
You must specify a region. You can also configure your region by running "aws configure".
error: failed to execute wrapped command: exit status 1

I've pushed 571dcc0340 -> 390d6be to fix.

@wking
Copy link
Member Author

wking commented Jun 18, 2020

etcd-operator/e2e-aws looks good now. I think this PR is good to merge.

/assign @cgwalters

Since you've glanced at it already, can you /lgtm when you wake up? I'm already an approver here, so we just need someone to lean in with an /lgtm.

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-azure-shared-vpc 390d6be link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-aws-shared-vpc 390d6be link /test pj-rehearse
ci/rehearse/open-cluster-management/registration-operator/master/e2e 390d6be link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-hybrid-step-registry 390d6be link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-ovn-step-registry 390d6be link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-windows-hybrid-network 390d6be link /test pj-rehearse
ci/rehearse/opendatahub-io/odh-manifests/master/odh-manifests-e2e 390d6be link /test pj-rehearse
ci/rehearse/openshift-knative/serverless-operator/master/4.3-upgrade-tests-aws-ocp-43 390d6be link /test pj-rehearse
ci/rehearse/openshift/cluster-network-operator/master/e2e-vsphere 390d6be link /test pj-rehearse
ci/rehearse/openshift-knative/serverless-operator/master/4.3-e2e-aws-ocp-43 390d6be link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-vsphere 390d6be link /test pj-rehearse
ci/prow/pj-rehearse 390d6be link /test pj-rehearse
ci/rehearse/cri-o/cri-o/release-1.13/e2e-aws 390d6be link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member

Looks like it worked: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/9743/rehearse-9743-pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1273440493165875200/artifacts/e2e-aws/gather-aws-console/i-01733f89389ffa086

(Does submitting jobs like this really try rehearsing all jobs whch depend on it? That's like ~20 clusters launched by this PR or so?)

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Jun 18, 2020

Does submitting jobs like this really try rehearsing all jobs whch depend on it? That's like ~20 clusters launched by this PR or so?

I think there's a cap. But yeah, it's a high cap.

@openshift-merge-robot openshift-merge-robot merged commit 9ce598e into openshift:master Jun 18, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 2 configmaps:

  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key OWNERS using file ci-operator/step-registry/gather/aws-console/OWNERS
    • key gather-aws-console-commands.sh using file ci-operator/step-registry/gather/aws-console/gather-aws-console-commands.sh
    • key gather-aws-console-ref.yaml using file ci-operator/step-registry/gather/aws-console/gather-aws-console-ref.yaml
    • key gather-extra-commands.sh using file ci-operator/step-registry/gather/extra/gather-extra-commands.sh
    • key ipi-aws-post-chain.yaml using file ci-operator/step-registry/ipi/aws/post/ipi-aws-post-chain.yaml
    • key ipi-deprovision-chain.yaml using file ci-operator/step-registry/ipi/deprovision/ipi-deprovision-chain.yaml
  • step-registry configmap in namespace ci at cluster api.ci using the following files:
    • key OWNERS using file ci-operator/step-registry/gather/aws-console/OWNERS
    • key gather-aws-console-commands.sh using file ci-operator/step-registry/gather/aws-console/gather-aws-console-commands.sh
    • key gather-aws-console-ref.yaml using file ci-operator/step-registry/gather/aws-console/gather-aws-console-ref.yaml
    • key gather-extra-commands.sh using file ci-operator/step-registry/gather/extra/gather-extra-commands.sh
    • key ipi-aws-post-chain.yaml using file ci-operator/step-registry/ipi/aws/post/ipi-aws-post-chain.yaml
    • key ipi-deprovision-chain.yaml using file ci-operator/step-registry/ipi/deprovision/ipi-deprovision-chain.yaml

In response to this:

History of this logic:

The aws-instance-ids.txt injection allows install-time steps to register additional instances for later console collection (for a proxy instance, bootstrap instance, etc.).

Approvers are:

  • Myself and @vrutkovs, who have touched this logic in the past.
  • @cgwalters, representing the RHCOS/machine-config space that needs console logs to debug RHCOS issues.
  • @enxebre, representing the machine-API space that needs console logs to debug failed-to-boot issues.

Feel free to recommend additional approvers :).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the aws-console-gather-step branch June 18, 2020 13:28
Comment on lines +5 to +8
env:
- name: TMPDIR
default: /tmp
documentation: A pathname of a directory made available for programs that need a place to create temporary files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thought as #9640 (comment)

I think exposing TMPDIR as a user option is mis use of setting variables for the bash script.

@wking
Copy link
Member Author

wking commented Jun 19, 2020

Looks like 4.3- upi-installer lack aws. Will reroll to use a custom image or something.

@wking
Copy link
Member Author

wking commented Jun 19, 2020

Fixup in #9791.

wking added a commit to wking/openshift-release that referenced this pull request Dec 16, 2020
…urced IDs

Fixing a typo we've had since the step was born in 390d6be
(gather/aws/console: Add a step to gather console logs on AWS,
2020-06-17, openshift#9743).
wking added a commit to wking/openshift-release that referenced this pull request Dec 17, 2020
Like the similar aws-console step, but for GCP, following [1].  I'm
not sure why they require a --zone argument when 'instances list'
seems to be able to figure this out, but whatever.  Leaving it off
doesn't seem to be an option:

  wking@cloudshell:~ (openshift-gce-devel-ci)$ gcloud --format json compute instances get-serial-port-output "${INSTANCE_ID}" </dev/null >logs.json
  ERROR: (gcloud.compute.instances.get-serial-port-output) Underspecified resource [ci-ln-2gtd6db-f76d1-fj4cl-master-0]. Specify the [--zone] flag.
  wking@cloudshell:~ (openshift-gce-devel-ci)$ gcloud --version | head -n1
  Google Cloud SDK 315.0.0

OWNERS are like AWS, from 390d6be (gather/aws/console: Add a step
to gather console logs on AWS, 2020-06-17, openshift#9743), and also the
installer's GCP approvers [2].

[1]: https://cloud.google.com/compute/docs/reference/rest/v1/instances/getSerialPortOutput
[2]: https://github.com/openshift/installer/blob/e17697302ff991ae80d6e9c0c10426db510881e2/OWNERS_ALIASES#L60-L62
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants