Skip to content

Unable to make external metrics unnamespaced #690

Open
@Space-Banana-42

Description

@Space-Banana-42

What happened?:

Issue 1: namespaced: false config doesn't work
We tried to

  • expose external metrics to k8s API as no namespace queries, as the metric itself is from kafka running on another cluster and cannot be directly related to any k8s resources.
  • Use it in HPA

We applied the setting using prometheus adapter helm chart using command:
helm upgrade -i prometheus-adapter prometheus-community/prometheus-adapter --version 4.11.0 -n monitoring -f namespaces/monitoring/prometheus-adapter-values.yaml
With config:

rules:
  default: true
  external:
  - metricsQuery: aws_kafka_sum_offset_lag_sum
    name:
      as: aws_kafka_sum_offset_lag_sum
      matches: aws_kafka_sum_offset_lag_sum
    resources:
      namespaced: false
    seriesQuery: '{__name__=~"aws_kafka_sum_offset_lag_sum"}'

And we do can proof that our config applied by running kubectl get cm -n monitoring prometheus-adapter -o yaml

    externalRules:
    - metricsQuery: aws_kafka_sum_offset_lag_sum
      name:
        as: aws_kafka_sum_offset_lag_sum
        matches: aws_kafka_sum_offset_lag_sum
      resources:
        namespaced: false
      seriesQuery: '{__name__=~"aws_kafka_sum_offset_lag_sum"}'

However when we try to verify that the metrics do appear in k8s API by running: kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 | jq .
We got:

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "aws_kafka_sum_offset_lag_sum",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

You can see here the "namespaced" is true, which contradicts our config.

Issue 2: label filtering doesn't work

We change the adapter config as this:

rules:
  default: true
  external:
  - metricsQuery: sum(aws_kafka_sum_offset_lag_sum) by (cluster_name, consumer_group, topic)
    name:
      as: aws_kafka_sum_offset_lag_sum
      matches: aws_kafka_sum_offset_lag_sum
    resources:
      overrides:
        cluster_name: { resource: "namespace" }
    seriesQuery: '{__name__=~"aws_kafka_sum_offset_lag_sum"}'

And HPA setting:

  - external:
      metric:
        name: aws_kafka_sum_offset_lag_sum
        selector:
          matchLabels:
            cluster_name: FOO
            consumer_group: bar
            topic: foobar1
      target:
        type: Value
        value: "500"
    type: External

However, the label filtering doesn't work as expected, the HPA takes all metrics with unmatched labels.
We can verify this by running

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/dna-alive-matchmaking-dev/aws_kafka_sum_offset_lag_sum?labelSelector=cluster_name%3DFOO%2Cconsumer_group%3Dbar%2Ctopic%3Dfoobar1" | jq .

And we get all metrics including these with unmatched labels.

issue 3: with the same config, sometimes kubectl get --raw /apis/external.metrics.k8s.io/v1beta1 return empty results

Not sure if these 3 issues are related, so put all of them here. Kindly help!

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs

adapter logs
We do notice some timeout in logs all the time, unsure if it is related to the above issues.

E0225 01:36:07.768071       1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
E0225 01:36:07.767982       1 writers.go:135] apiserver was unable to write a fallback JSON response: http2: stream closed
E0225 01:36:07.767987       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0225 01:36:07.767987       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0225 01:36:07.767993       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"http2: stream closed"}: http2: stream closed
E0225 01:36:07.769167       1 writers.go:135] apiserver was unable to write a fallback JSON response: http2: stream closed
E0225 01:36:07.769194       1 timeout.go:142] post-timeout activity - time-elapsed: 2.777574ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0225 01:36:07.769210       1 timeout.go:142] post-timeout activity - time-elapsed: 3.077884ms, GET "/apis/custom.metrics.k8s.io/v1beta1" result: <nil>
E0225 01:36:07.769226       1 writers.go:135] apiserver was unable to write a fallback JSON response: http2: stream closed
E0225 01:36:07.769227       1 writers.go:135] apiserver was unable to write a fallback JSON response: http2: stream closed

Anything else we need to know?:

Environment:

  • prometheus-adapter image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0
  • prometheus-adapter chart: 4.11.0
  • prometheus image version: quay.io/prometheus/prometheus:v2.48.1
  • prometheus chart version: prometheus-community/kube-prometheus-stack --version 55.7.1
  • Kubernetes version (use kubectl version): v1.30.9-eks-8cce635
  • Cloud provider or hardware configuration: AWS EKS v1.30.9-eks-8cce635
  • Other info: we do assign enough CPU/memory resources add can verify it is not the bottleneck.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions