Skip to content

🐞 Fix: deploying opsman to vSphere 15% boot fail #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 1, 2024

Conversation

cunnie
Copy link
Contributor

@cunnie cunnie commented Apr 29, 2024

When deploying opsman to vSphere, it fails to boot 15% of the time. It happens very early in the boot process, apparently even before loading the kernel. When viewing the opsman's VM's console, the symptom is a flashing cursor in the upper left hand side of the screen.

This commit fixes that failure by waiting 80 seconds for the opsman VM to report its IP address to vCenter, and if it hasn't reported its IP address by then, it sends a hardware reset to the VM. An opsman VM typically reports its IP address to vCenter 43 seconds after being powered-on.

We verified this fix by successfully deploying & booting opsman 146 times in a row.

More about the boot failure:

  • The boot failure only occurs the very first time an opsman is booted; subsequent boots will always succeed. We tested 100 shutdown/boots to confirm.
  • The failure was seen both on vSphere 7 and vSphere 8.
  • Sending a reset or a ctl-alt-del to the machine within the first few seconds of being powered-on reduced but did not eliminate the failure.

This fix should have negligible impact on the length of time to deploy opsman.

Typical output when resetting a failed initial boot:

Executing: "govc vm.info -vm.ipath=/dc/vm/pcf_vms/om.tas.nono.io -waitip"
This could take a few moments...
VM hasn't acquired IP, is probably stuck, resetting VM to free it

Executing: "govc vm.power -vm.ipath=/dc/vm/pcf_vms/om.tas.nono.io -reset"
This could take a few moments...
govc[stdout]: Reset VirtualMachine:vm-42616... OK

The added tests are admittedly lackluster, but I couldn't find a way to implement them without making vsphere.go overly-complicated.

@cf-gitbot
Copy link

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@wayneadams
Copy link
Contributor

Hi @cunnie , the core changes here look good.

I do have a nitpick about adding a couple of fmt.Println into the production code. This may just be a stylistic bias on my part, so let me see what the team says tomorrow.

When deploying opsman to vSphere, it fails to boot 15% of the time. It
happens very early in the boot process, apparently even before loading
the kernel. When viewing the opsman's VM's console, the symptom is a
flashing cursor in the upper left hand side of the screen.

This commit fixes that failure by waiting 80 seconds for the opsman VM
to report its IP address to vCenter, and if it hasn't reported its IP
address by then, it sends a hardware reset to the VM. An opsman VM
typically reports its IP address to vCenter 43 seconds after being
powered-on.

We verified this fix by successfully deploying & booting opsman 146
times in a row.

More about the boot failure:

- The boot failure only occurs the very first time an opsman is booted;
  subsequent boots will always succeed. We tested 100 shutdown/boots to
  confirm.
- The failure was seen both on vSphere 7 and vSphere 8.
- Sending a reset or a ctl-alt-del to the machine within the first few
  seconds of being powered-on reduced but did not eliminate the failure.

This fix should have negligible impact on the length of time to deploy
opsman.

Typical output when resetting a failed initial boot:

```
Executing: "govc vm.info -vm.ipath=/dc/vm/pcf_vms/om.tas.nono.io -waitip"
This could take a few moments...
VM hasn't acquired IP, is probably stuck, resetting VM to free it

Executing: "govc vm.power -vm.ipath=/dc/vm/pcf_vms/om.tas.nono.io -reset"
This could take a few moments...
govc[stdout]: Reset VirtualMachine:vm-42616... OK
```
@cunnie cunnie force-pushed the vsphere-stuck-boot branch from 6a9b90c to 99183ce Compare May 1, 2024 18:28
@cunnie
Copy link
Contributor Author

cunnie commented May 1, 2024

I notice this PR hasn't been merged yet, so I'm gonna remove the fmt.Println()s as you suggest.

@wayneadams wayneadams merged commit 206d60e into main May 1, 2024
1 check passed
@wayneadams wayneadams deleted the vsphere-stuck-boot branch May 1, 2024 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants