k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

abhishekmahajan0709222 · 2024-06-27T20:24:05Z

Component(s)

receiver/k8scluster

What happened?

Description

My pods are in crash backoff look but it still showing as running status

Steps to Reproduc

Expected Result

It should give us phase if my pods are in crashbackoff look

Actual Result

Its giving us Running phase that's incorrect

Collector version

Latest(v0.103.0)

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions · 2024-06-27T20:24:19Z

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv

See Adding Labels via Comments if you do not have permissions to add labels yourself.

denmanveer · 2024-07-03T20:39:57Z

This is affecting us as well, @dmitryax @TylerHelmuth @povilasv can anyone help please ?
Why pod in CrashLoopBackOff shown as running in metrics?
thank you

abhishekmahajan0709222 · 2024-08-16T13:10:19Z

@dmitryax @TylerHelmuth @povilasv

Is there any update on the issue ?

povilasv · 2024-08-19T06:53:14Z

Hey, we map K8S pod.Status.Phase field to k8s.pod.phase metric, with the following mapping:

func phaseToInt(phase corev1.PodPhase) int32 {
	switch phase {
	case corev1.PodPending:
		return 1
	case corev1.PodRunning:
		return 2
	case corev1.PodSucceeded:
		return 3
	case corev1.PodFailed:
		return 4
	case corev1.PodUnknown:
		return 5
	default:
		return 5
	}
}

If it was showing running, then the pod at that time was running.

There is no pod status phase for crashloop back off. For this see this issue -> #32457

abhishekmahajan0709222 · 2024-08-19T20:40:25Z

@povilasv that's kind of passing wrong information

You can check in below image status is giving as crash back loop

But metrics is showing as Running

povilasv · 2024-08-20T05:26:03Z

Could you paste the output of your kubectl get pod x -o yaml ?

It should have a "phase" field:

  hostIP: 172.18.0.2
  hostIPs:
  - ip: 172.18.0.2
  phase: Running
  podIP: 172.18.0.2
  podIPs:
  - ip: 172.18.0.2
  qosClass: Burstable
  startTime: "2024-08-20T05:22:28Z"

povilasv · 2024-08-20T05:27:31Z

I think I found the issue. Basically K8s docs state this:

// PodStatus represents information about the status of a pod. Status may trail the actual
// state of a system, especially if the node that hosts the pod cannot contact the control
// plane.
type PodStatus struct {
	// The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
	// The conditions array, the reason and message fields, and the individual container status
	// arrays contain more detail about the pod's status.
	// There are five possible phase values:
	//
	// Pending: The pod has been accepted by the Kubernetes system, but one or more of the
	// container images has not been created. This includes time before being scheduled as
	// well as time spent downloading images over the network, which could take a while.
	// Running: The pod has been bound to a node, and all of the containers have been created.
	// At least one container is still running, or is in the process of starting or restarting.
	// Succeeded: All containers in the pod have terminated in success, and will not be restarted.
	// Failed: All containers in the pod have terminated, and at least one container has
	// terminated in failure. The container either exited with non-zero status or was terminated
	// by the system.
	// Unknown: For some reason the state of the pod could not be obtained, typically due to an
	// error in communicating with the host of the pod.
	//

I think the crash loop back off status fits into K8s "running" category:

Running: The pod has been bound to a node, and all of the containers have been created.
At least one container is still running, or is in the process of starting or restarting.

github-actions · 2024-12-11T03:38:10Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv @ChrsMark

See Adding Labels via Comments if you do not have permissions to add labels yourself.

genadipost · 2024-12-26T16:19:27Z

Any update or workaround?

r-asiebert · 2025-01-21T01:17:33Z

This seems to be a case where kubectl showing nicely formatted data can be confusing: I believe @povilasv is correct, CrashLoopBackOff is not a Kubernetes pod phase.
Instead, it is a "reason" in the state/status of each container, which is shown by kubectl get pods in an aggregated "status" based on the information from the containers or the pod itself. (1)

To see the data straight from Kubernetes, use kubectl get pod FAILING_POD -o yaml:

$ k get pods
NAME          READY   STATUS             RESTARTS        AGE
FAILING_POD   0/1     CrashLoopBackOff   6 (4m30s ago)   10m

$ k get pod FAILING_POD -o yaml
apiVersion: v1
kind: Pod
...
  containerStatuses:
  - containerID: containerd://a040a3dcd7794a65b06894738892a1c6b1460b6a366b5cc27b2708e147947309
    image: public.ecr.aws/lts/ubuntu:edge
    imageID: public.ecr.aws/lts/ubuntu@sha256:da20fb875cfefd317c49e7aaf3998d3e5ad42c5b20f34a0eec6dca2fe4fbb8f4
    lastState:
      terminated:
        containerID: containerd://a040a3dcd7794a65b06894738892a1c6b1460b6a366b5cc27b2708e147947309
        exitCode: 128
        finishedAt: "2025-01-21T00:04:51Z"
        message: 'failed to create containerd task: failed to create shim task: OCI
          runtime create failed: runc create failed: unable to start container process:
          exec: "/bin/fail": stat /bin/fail: no such file or directory: unknown'
        reason: StartError
        startedAt: "1970-01-01T00:00:00Z"
    name: ubuntu
    ready: false
    restartCount: 6
    started: false
    state:
      waiting:
        message: back-off 5m0s restarting failed container=ubuntu pod=FAILING_POD-pod_default(f398ecb6-77fe-42b4-8c96-530716daee5f)
        reason: CrashLoopBackOff
  phase: Running
  ...

Note the pod phase: Running, and the container CrashLoopBackOff buried pretty deep.

--

So this issue is likely invalid, k8s.pod.phase is accurate.

On the other hand, OTel's k8sclusterreceiver could expose this useful information another way.
The similar kube-state-metrics project exposes a kube_pod_container_status_waiting_reason metric, and its reason label may be set to CrashLoopBackOff, ErrImagePull, etc. (Note there's a "waiting_reason" metric, but also "terminated_reason", etc.)
The OTel Collector could have something similar.

Notes:

(1) I don't know what heuristics are used by kubectl: with two broken containers, one in ErrImagePull and the other with CrashLoopBackOff, and the pod in phase: Pending, k get pods shows a "status" of ImagePullBackOff. Why this one over the other, why not "Pending"? It's convenient, but...

r-asiebert · 2025-01-22T21:41:07Z

Confirmation: randomly waddling through the Kubernetes docs on pod phases, I found it calls out this confusion explicitly:

Note:
When a pod is failing to start repeatedly, CrashLoopBackOff may appear in the Status field of some kubectl commands. Similarly, when a pod is being deleted, Terminating may appear in the Status field of some kubectl commands.

Make sure not to confuse Status, a kubectl display field for user intuition, with the pod's phase. Pod phase is an explicit part of the Kubernetes data model and of the Pod API.
NAMESPACE               NAME               READY   STATUS             RESTARTS   AGE
alessandras-namespace   alessandras-pod    0/1     CrashLoopBackOff   200        2d9h
A Pod is granted a term to terminate gracefully, which defaults to 30 seconds. You can use the flag --force to terminate a Pod by force.

IMO this issue can get closed as invalid, or we turn it into a feature request for an additional signal exposing CrashLoopBackOff and the likes.

github-actions · 2025-03-24T03:39:13Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

receiver/k8sclusterreceiver: @dmitryax @TylerHelmuth @povilasv @ChrsMark

See Adding Labels via Comments if you do not have permissions to add labels yourself.

abhishekmahajan0709222 added bug Something isn't working needs triage New item requiring triage labels Jun 27, 2024

github-actions bot added the receiver/k8scluster label Jun 27, 2024

github-actions bot mentioned this issue Jul 2, 2024

Weekly Report: 2024-06-25 - 2024-07-02 #33839

Closed

github-actions bot mentioned this issue Jul 9, 2024

Weekly Report: 2024-07-02 - 2024-07-09 #33962

Closed

This was referenced Jul 16, 2024

Weekly Report: 2024-07-09 - 2024-07-16 #34087

Closed

Weekly Report: 2024-07-16 - 2024-07-23 #34202

Closed

This was referenced Jul 30, 2024

Weekly Report: 2024-07-23 - 2024-07-30 #34301

Closed

Weekly Report: 2024-07-30 - 2024-08-06 #34410

Closed

github-actions bot mentioned this issue Aug 13, 2024

Weekly Report: 2024-08-06 - 2024-08-13 #34626

Closed

github-actions bot mentioned this issue Aug 20, 2024

Weekly Report: 2024-08-13 - 2024-08-20 #34743

Closed

This was referenced Aug 27, 2024

Weekly Report: 2024-08-20 - 2024-08-27 #34856

Closed

Weekly Report: 2024-08-27 - 2024-09-03 #34966

Closed

This was referenced Sep 10, 2024

Weekly Report: 2024-09-03 - 2024-09-10 #35086

Closed

Weekly Report: 2024-09-10 - 2024-09-17 #35228

Closed

This was referenced Sep 24, 2024

Weekly Report: 2024-09-17 - 2024-09-24 #35377

Closed

Weekly Report: 2024-09-24 - 2024-10-01 #35498

Closed

github-actions bot mentioned this issue Oct 8, 2024

Weekly Report: 2024-10-01 - 2024-10-08 #35659

Closed

atoulme removed the needs triage New item requiring triage label Oct 11, 2024

github-actions bot added the Stale label Dec 11, 2024

povilasv mentioned this issue Dec 11, 2024

Add k8s.container.status.waiting metric to semantic conventions open-telemetry/semantic-conventions#1672

Open

github-actions bot removed the Stale label Dec 27, 2024

github-actions bot added the Stale label Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

abhishekmahajan0709222 commented Jun 27, 2024

github-actions bot commented Jun 27, 2024

denmanveer commented Jul 3, 2024

abhishekmahajan0709222 commented Aug 16, 2024

povilasv commented Aug 19, 2024

abhishekmahajan0709222 commented Aug 19, 2024

povilasv commented Aug 20, 2024

povilasv commented Aug 20, 2024 •

edited

Loading

github-actions bot commented Dec 11, 2024

genadipost commented Dec 26, 2024

r-asiebert commented Jan 21, 2025

r-asiebert commented Jan 22, 2025

github-actions bot commented Mar 24, 2025

k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

k8s.pod.phase not providing correct info if my pod status is Crashbackoff look #33797

Comments

abhishekmahajan0709222 commented Jun 27, 2024

Component(s)

What happened?

Description

Steps to Reproduc

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Jun 27, 2024

denmanveer commented Jul 3, 2024

abhishekmahajan0709222 commented Aug 16, 2024

povilasv commented Aug 19, 2024

abhishekmahajan0709222 commented Aug 19, 2024

povilasv commented Aug 20, 2024

povilasv commented Aug 20, 2024 • edited Loading

github-actions bot commented Dec 11, 2024

genadipost commented Dec 26, 2024

r-asiebert commented Jan 21, 2025

r-asiebert commented Jan 22, 2025

github-actions bot commented Mar 24, 2025

povilasv commented Aug 20, 2024 •

edited

Loading