Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to scrape metrics for kubevirt-hyperconverged-operator #3338

Open
nakkoh opened this issue Mar 12, 2025 · 7 comments
Open

Unable to scrape metrics for kubevirt-hyperconverged-operator #3338

nakkoh opened this issue Mar 12, 2025 · 7 comments
Labels

Comments

@nakkoh
Copy link

nakkoh commented Mar 12, 2025

What happened:
A clear and concise description of what the bug is.

Prometheus unable to scrape metrics for kubevirt-hyperconverged-operator.

Image

The cause of this problem appears to be the lack of authorization.

What you expected to happen:
A clear and concise description of what you expected to happen.

How to reproduce it (as minimally and precisely as possible):
Steps to reproduce the behavior.

Deploy KubeVirt HyperConverged Cluster Operator on OKD.

Additional context:
Add any other context about the problem here.

Environment:

  • KubeVirt version (use virtctl version): v1.4.0
  • Kubernetes version (use kubectl version): v1.31.6
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration:
    • controle plane nodes: OpenStack
    • compute nodes: Baremetal
  • OS (e.g. from /etc/os-release): CentOS Stream CoreOS 418.9.202503040632-0
  • Kernel (e.g. uname -a): 5.14.0-570.el9.x86_64
  • Install tools: N/A
  • Others:
    • OKD 4.18.0-okd-scos.3
@orenc1
Copy link
Collaborator

orenc1 commented Mar 12, 2025

Hi @machadovilaca , could you please check?
I suspect it's related to #3303

@machadovilaca
Copy link
Member

Hello @nakkoh,

can you share the hco-operator pod logs and the config of the kubevirt-hyperconverged-operator-metrics ServiceMonitor?

@nakkoh
Copy link
Author

nakkoh commented Mar 13, 2025

Thank you @machadovilaca

Please refer to the following attachment regarding hco-operator logs.
hco_operator.log

And the kubevirt-hyperconverged-operator-metrics ServiceMonitor is defined as follows.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: "2025-02-28T06:31:10Z"
  generation: 16
  labels:
    app: kubevirt-hyperconverged
    app.kubernetes.io/component: monitoring
    app.kubernetes.io/managed-by: hco-operator
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 1.14.0
  name: kubevirt-hyperconverged-operator-metrics
  namespace: kubevirt-hyperconverged
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: false
    controller: false
    kind: Deployment
    name: hco-operator
    uid: 19e2af5c-076d-4bbe-9450-f0d978e830be
  resourceVersion: "19149989"
  uid: b6df0cc7-c4e4-4c30-8039-8b22e0a89c12
spec:
  endpoints:
  - authorization:
      credentials:
        key: token
        name: hco-bearer-auth
    port: http-metrics
  namespaceSelector: {}
  selector:
    matchLabels:
      app: kubevirt-hyperconverged                                                                        
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/managed-by: hco-operator
      app.kubernetes.io/part-of: hyperconverged-cluster
      app.kubernetes.io/version: 1.14.0

@nakkoh
Copy link
Author

nakkoh commented Mar 13, 2025

I noticed that the metrics can be scraped.
Image

I don't know if this is relevant, but I made the following configuration changes to prometheus.
https://docs.okd.io/4.18/observability/monitoring/configuring-core-platform-monitoring/storing-and-recording-data.html#configuring-a-persistent-volume-claim_storing-and-recording-data

@machadovilaca
Copy link
Member

The ServiceMonitor looks correct, and the logs are not the original, so we might now see the problem.
I was looking for something that might indicate that the ServiceMonitor was not correctly reconciled and that we failed to create the tokens.

In these logs we don't see anything working incorrectly and we are also able to see:
{"level":"info","ts":"2025-03-12T15:32:57Z","msg":"Starting EventSource","controller":"hyperconverged-controller","source":"kind source: *v1.ServiceMonitor"}

@machadovilaca
Copy link
Member

Maybe this is somehow related to the timing between the resource update and Prometheus syncing its config with it. Probably unlikely, but worth checking, I think.

@nakkoh
Copy link
Author

nakkoh commented Mar 14, 2025

I have not made any changes, but I noticed that the metrics can not be collected again.

I tried to get metrics in the prometheus pod and it worked.

$ oc -n kubevirt-hyperconverged get secret hco-bearer-auth -o jsonpath='{.data.tokene64 -ds 
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.mQMUDvdzAZcBLtY-aAQ0Am5_Qxe3GNISjGxnoqe7aI4
$ oc -n openshift-monitoring exec -it prometheus-k8s-0 -- bash
bash-5.1$ export TOKEN=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.mQMUDvdzAZcBLtY-aAQ0Am5_Qxe3GNISjGxnoqe7aI4
bash-5.1$ curl -H "Authorization: Bearer ${TOKEN}" http://10.130.2.28:8383/metrics
# HELP certwatcher_read_certificate_errors_total Total number of certificate read errors
# TYPE certwatcher_read_certificate_errors_total counter
certwatcher_read_certificate_errors_total 0

... snip ...

Next, I checked the definition of scrape in the prometheus pod and confirmed that the token values are different from those defined in hco-bearer-auth secret.

$ oc -n openshift-monitoring exec prometheus-k8s-0 -- cat/etc/prometheus/config_out/prometheus.env.yaml
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    prometheus: openshift-monitoring/k8s
    prometheus_replica: prometheus-k8s-0
rule_files:
- /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
scrape_configs:
- job_name: serviceMonitor/kubevirt-hyperconverged/kubevirt-hyperconverged-operator-metrics/0
  honor_labels: false
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - kubevirt-hyperconverged
  authorization:
    type: Bearer
    credentials: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.twlCgfRdTBHjHHVVLZSffx6ZMYGc-rHxQGLkIrGhUg4
  relabel_configs:

... snip ...

So it seems that servicemonitor is not reflected in the prometheus definition.
Is it a prometheus-operator issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants