Skip to content

Plenty of workers suspended for several days. #331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mwos-sl opened this issue Apr 11, 2022 · 8 comments
Closed

Plenty of workers suspended for several days. #331

mwos-sl opened this issue Apr 11, 2022 · 8 comments
Labels
bug stale Issues / PRs with no activity

Comments

@mwos-sl
Copy link

mwos-sl commented Apr 11, 2022

Issue Details

Not sure if related to: #322, thus creating a separate issue.
Recently we noticed a very long jobs queue on jenkins. It turned out all ec2 fleets were maxed, but plenty of instances were in "suspended" and not disconnected from jenkins. EC2 instances themselves were fine on aws side.
Even when I killed suspended instances using script, the new ones did some work but some of them eventually became suspended after some time too. IDLE time was not respected for those (we've got idle timeout configured to 3 minutes, some instances were connected for several days).

Describe the bug
Example observation:
A lot of builds waiting for the an agent:
image

And we can see that maximum number of agents for this label is already online:
image

Still I see a lot of workers empty - with no job assisgned.
I see some of them with (suspended) text - in the view below all with red cross (screenshot from: https://<jenkins url>/computer/):

image

Vast majority of nodes with red-cross meaning "suspended".
One of those instances:
image

So this one was alive for several hours and it hasn't even built a single job. Yet plenty of jobs are waiting tens of minutes for this label to become available.

At this point we had to turn off ec2-fleet-plugin plugin and go back to ec2-plugin.
After we switched all the work to workers managed by ec2-plugin, the situation stabilised.
However we would love to go back to using this plugin (ec2-fleet-plugin), because it's better with handling spot workers and using multi-instance-type.

To Reproduce
I don't know. It just happened for all of our labels using this plugin.

Environment Details

Plugin Version?
2.5.0

Jenkins Version?
2.336

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?
Linux

EC2Fleet Configuration as Code

  - eC2Fleet:
      addNodeOnlyIfRunning: false
      alwaysReconnect: true
      cloudStatusIntervalSec: 10
      computerConnector:
        sSHConnector:
          credentialsId: "standard-runner-ubuntu-user-private-key"
          launchTimeoutSeconds: 60
          maxNumRetries: 10
          port: 22
          prefixStartSlaveCmd: "source /usr/bin/init-script.sh; "
          retryWaitTime: 15
          sshHostKeyVerificationStrategy:
            manuallyTrustedKeyVerificationStrategy:
              requireInitialManualTrust: false
      disableTaskResubmit: false
      fleet: "build-jenkins-executor-sumo-agr-spot"
      fsRoot: "/mnt/jenkins/workspaces"
      idleMinutes: 3
      initOnlineCheckIntervalSec: 15
      initOnlineTimeoutSec: 600
      labelString: "sumo build agr spot 1executors"
      maxSize: 110
      maxTotalUses: 50
      minSize: 0
      minSpareSize: 0
      name: "build-jenkins-fleet-sumo-agr-spot"
      noDelayProvision: true
      numExecutors: 1
      oldId: "2fda8ebd-f7e6-4211-91c8-219c7ed8ceb6"
      privateIpUsed: true
      region: "us-west-2"
      restrictUsage: true
      scaleExecutorsByWeight: false

Is this config OK?

Anything else unique about your setup?
Not sure if it's important, but our ASGs have:

protect_from_scale_in     = true
@mwos-sl mwos-sl added the bug label Apr 11, 2022
@mwos-sl
Copy link
Author

mwos-sl commented Apr 11, 2022

One of the nodes was online and marked "suspended" for several days, even thoug idle time is configured to 3 minutes, and there is no minSize and minSpareInstances configured for this particular cluster:
image

@haugenj
Copy link

haugenj commented Apr 12, 2022

config is fine afaik.

Scale-in protection being enabled is correct, the plugin will enable scale in protection if not set because the plugin wants to control which instances are terminated. Without scale in protection an instance that is not idle might be terminated when the target capacity is adjusted.

The instances being in suspended state makes me think there is an issue with setting up the jenkins agents on the instances. I think I have seen this happen in the past when ssh credentials are misconfigured but it's been a while and my memory is poor

Can you provide the logs for this event?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label May 12, 2022
@mwos-sl
Copy link
Author

mwos-sl commented May 18, 2022

Can you provide the logs for this event?

We disabled the plugin completatly so I don't have logs with me at the moment. I can try to re-enable it in some sprints for a while to grab some. Just to confirm: are you talking about logs
image
?

@mwos-sl
Copy link
Author

mwos-sl commented May 18, 2022

BTW can we remove stale label?

@github-actions github-actions bot removed the stale Issues / PRs with no activity label May 18, 2022
@haugenj
Copy link

haugenj commented May 24, 2022

yeah logs from the instance and also the system log, preferably with a debug level

image

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Jun 23, 2022
@github-actions
Copy link

This issue was closed because it has become stale with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale Issues / PRs with no activity
Projects
None yet
Development

No branches or pull requests

2 participants