Plenty of workers suspended for several days. #331

mwos-sl · 2022-04-11T16:49:23Z

Issue Details

Not sure if related to: #322, thus creating a separate issue.
Recently we noticed a very long jobs queue on jenkins. It turned out all ec2 fleets were maxed, but plenty of instances were in "suspended" and not disconnected from jenkins. EC2 instances themselves were fine on aws side.
Even when I killed suspended instances using script, the new ones did some work but some of them eventually became suspended after some time too. IDLE time was not respected for those (we've got idle timeout configured to 3 minutes, some instances were connected for several days).

Describe the bug
Example observation:
A lot of builds waiting for the an agent:

And we can see that maximum number of agents for this label is already online:

Still I see a lot of workers empty - with no job assisgned.
I see some of them with (suspended) text - in the view below all with red cross (screenshot from: https://<jenkins url>/computer/):

Vast majority of nodes with red-cross meaning "suspended".
One of those instances:

So this one was alive for several hours and it hasn't even built a single job. Yet plenty of jobs are waiting tens of minutes for this label to become available.

At this point we had to turn off ec2-fleet-plugin plugin and go back to ec2-plugin.
After we switched all the work to workers managed by ec2-plugin, the situation stabilised.
However we would love to go back to using this plugin (ec2-fleet-plugin), because it's better with handling spot workers and using multi-instance-type.

To Reproduce
I don't know. It just happened for all of our labels using this plugin.

Environment Details

Plugin Version?
2.5.0

Jenkins Version?
2.336

Spot Fleet or ASG?
ASG

Label based fleet?
No

Linux or Windows?
Linux

EC2Fleet Configuration as Code

  - eC2Fleet:
      addNodeOnlyIfRunning: false
      alwaysReconnect: true
      cloudStatusIntervalSec: 10
      computerConnector:
        sSHConnector:
          credentialsId: "standard-runner-ubuntu-user-private-key"
          launchTimeoutSeconds: 60
          maxNumRetries: 10
          port: 22
          prefixStartSlaveCmd: "source /usr/bin/init-script.sh; "
          retryWaitTime: 15
          sshHostKeyVerificationStrategy:
            manuallyTrustedKeyVerificationStrategy:
              requireInitialManualTrust: false
      disableTaskResubmit: false
      fleet: "build-jenkins-executor-sumo-agr-spot"
      fsRoot: "/mnt/jenkins/workspaces"
      idleMinutes: 3
      initOnlineCheckIntervalSec: 15
      initOnlineTimeoutSec: 600
      labelString: "sumo build agr spot 1executors"
      maxSize: 110
      maxTotalUses: 50
      minSize: 0
      minSpareSize: 0
      name: "build-jenkins-fleet-sumo-agr-spot"
      noDelayProvision: true
      numExecutors: 1
      oldId: "2fda8ebd-f7e6-4211-91c8-219c7ed8ceb6"
      privateIpUsed: true
      region: "us-west-2"
      restrictUsage: true
      scaleExecutorsByWeight: false

Is this config OK?

Anything else unique about your setup?
Not sure if it's important, but our ASGs have:

protect_from_scale_in     = true

The text was updated successfully, but these errors were encountered:

mwos-sl · 2022-04-11T16:57:08Z

One of the nodes was online and marked "suspended" for several days, even thoug idle time is configured to 3 minutes, and there is no minSize and minSpareInstances configured for this particular cluster:

haugenj · 2022-04-12T17:27:38Z

config is fine afaik.

Scale-in protection being enabled is correct, the plugin will enable scale in protection if not set because the plugin wants to control which instances are terminated. Without scale in protection an instance that is not idle might be terminated when the target capacity is adjusted.

The instances being in suspended state makes me think there is an issue with setting up the jenkins agents on the instances. I think I have seen this happen in the past when ssh credentials are misconfigured but it's been a while and my memory is poor

Can you provide the logs for this event?

github-actions · 2022-05-12T17:32:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

mwos-sl · 2022-05-18T12:18:26Z

Can you provide the logs for this event?

We disabled the plugin completatly so I don't have logs with me at the moment. I can try to re-enable it in some sprints for a while to grab some. Just to confirm: are you talking about logs

?

mwos-sl · 2022-05-18T12:19:00Z

BTW can we remove stale label?

haugenj · 2022-05-24T16:11:36Z

yeah logs from the instance and also the system log, preferably with a debug level

github-actions · 2022-06-23T17:23:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

github-actions · 2022-06-28T17:27:10Z

This issue was closed because it has become stale with no activity.

mwos-sl added the bug label Apr 11, 2022

github-actions bot added the stale Issues / PRs with no activity label May 12, 2022

github-actions bot removed the stale Issues / PRs with no activity label May 18, 2022

github-actions bot added the stale Issues / PRs with no activity label Jun 23, 2022

github-actions bot closed this as completed Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plenty of workers suspended for several days. #331

Plenty of workers suspended for several days. #331

mwos-sl commented Apr 11, 2022 •

edited

Loading

mwos-sl commented Apr 11, 2022

haugenj commented Apr 12, 2022 •

edited

Loading

github-actions bot commented May 12, 2022

mwos-sl commented May 18, 2022

mwos-sl commented May 18, 2022

haugenj commented May 24, 2022

github-actions bot commented Jun 23, 2022

github-actions bot commented Jun 28, 2022

Plenty of workers suspended for several days. #331

Plenty of workers suspended for several days. #331

Comments

mwos-sl commented Apr 11, 2022 • edited Loading

Issue Details

Environment Details

mwos-sl commented Apr 11, 2022

haugenj commented Apr 12, 2022 • edited Loading

github-actions bot commented May 12, 2022

mwos-sl commented May 18, 2022

mwos-sl commented May 18, 2022

haugenj commented May 24, 2022

github-actions bot commented Jun 23, 2022

github-actions bot commented Jun 28, 2022

mwos-sl commented Apr 11, 2022 •

edited

Loading

haugenj commented Apr 12, 2022 •

edited

Loading