Skip to content

Expose metrics in protobuf format #1604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
invidian opened this issue Oct 11, 2021 · 20 comments
Closed

Expose metrics in protobuf format #1604

invidian opened this issue Oct 11, 2021 · 20 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@invidian
Copy link
Member

What would you like to be added:

kube-apiserver and kubelet (and probably other core Kubernetes components) supports scraping Prometheus metrics using protobuf:

> GET /metrics HTTP/2
> Host: 10.0.0.12:10250
> user-agent: curl/7.78.0
> accept: application/vnd.google.protobuf;proto=io.prometheus.client.MetricFamily;encoding=delimited;q=0.7,text/plain;version=0.0.4;q=0.3
>
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [1841 bytes data]
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
} [5 bytes data]
< HTTP/2 200
< content-type: application/vnd.google.protobuf; proto=io.prometheus.client.MetricFamily; encoding=delimited
< date: Mon, 11 Oct 2021 08:08:42 GMT
<

KSM only accepts plaintext protocol though:

09:24:13.100367 lo    In  IP6 (flowlabel 0x0b53e, hlim 64, next-header TCP (6) payload length: 290) ::1.42542 > ::1.8080: Flags [P.], cksum 0x012a (incorrect -> 0xc0de), seq 1:259, ack 1, win 512, options [nop,nop,TS val 963065885 ecr 963065885], length 258: HTTP, length: 258
        GET /metrics HTTP/1.1
        Host: localhost:8080
        User-Agent: Go-http-client/1.1
        Accept: application/vnd.google.protobuf;proto=io.prometheus.client.MetricFamily;encoding=delimited;q=0.7,text/plain;version=0.0.4;q=0.3
        Accept-Encoding: gzip
        Connection: close

09:24:13.100379 lo    In  IP6 (flowlabel 0x541a0, hlim 64, next-header TCP (6) payload length: 32) ::1.8080 > ::1.42542: Flags [.], cksum 0x0028 (incorrect -> 0x7144), ack 259, win 510, options [nop,nop,TS val 963065885 ecr 963065885], length 0
09:24:13.225339 lo    In  IP6 (flowlabel 0x541a0, hlim 64, next-header TCP (6) payload length: 32800) ::1.8080 > ::1.42542: Flags [P.], cksum 0x8028 (incorrect -> 0x23cb), seq 1:32769, ack 259, win 512, options [nop,nop,TS val 963066010 ecr 963065885], length 32768: HTTP, length: 32768
        HTTP/1.1 200 OK
        Content-Type: text/plain; version=0.0.4
        Date: Mon, 11 Oct 2021 07:24:13 GMT
        Connection: close
        Transfer-Encoding: chunked

Protobuf definitely offers smaller network traffic, as number of data transferred varies between 3-7 times less from quick testing in favor of protobuf. I've seen #498 issue, but I couldn't find anything related to this issue. Possibly protobuf encoding is also more CPU efficient?

Why is this needed:

To make KSM consume less resources and to make it more aligned with core Kubernetes components.

Additional context

Discovered as part of work on newrelic/nri-kubernetes#234.

@invidian invidian added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 11, 2021
@Serializator
Copy link
Contributor

DISCLAIMER; I do not mean to bash on the code which is written or on the person who wrote the code. It is only meant as "what should be done to allow for an "easier" path of implementing Protocol Buffers" and not in any way as criticism.

(metricshandler.MetricsHandler).ServeHTTP is responsible for serving HTTP requests for metrics. Though metricshandler.MetricsHandler is not only responsible for writing metrics to the response (to be a bit more specific than "serving HTTP requests") but as well for sharding and applying compression (GZIP).

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metricshandler/metrics_handler.go

I think this does too much and should be refactored before even trying to implement Protocol Buffers.

  • Sharding should be refactored outside of the HTTP handler, such that the HTTP handler is only aware of the metrics that should be written.

  • GZIP compression should be "decorated" onto the the HTTP handler or http.ResponseWriter before or after it is passed into (metricshandler.MetricsHandler).ServeHTTP, thus taking away the responsibility of applying compression away from the HTTP handler.

Though this goes further into (metricsstore.MetricsWriter).WriteAll which makes assumptions about the data format as well by manually writing \n to the response after the "HELP" of a metric.

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metrics_store/metrics_writer.go#L61-L69

In metricsstore.MetricsStore the metrics are already kept as a multi-dimensional byte array ([][]byte). Though I don't know whether this is a problem for implementing Protocol Buffers or not.

https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/metrics_store/metrics_store.go#L39

//cc @fpetkovski what do you think? In terms of the specific arguments I made about the code that should be refactored but as well about the approach of doing a refactor before implementing Protocol Buffers to keep it small and manageable.

A refactor would not only be about separating responsibilities but also be a bit of thinking ahead in terms of what abstractions to put in place to allow for easier implementation of Protocol Buffers in the future (or other formats for that matter).

@Serializator
Copy link
Contributor

I asked this in Slack as well but will ask it on here too simply so that everything and every question / answer is kept within the issue and doesn't get lost 👍🏼

I was thinking / prototyping a bit to implement Protocol Buffers in KSM and a simple question arose with maybe a simple answer but I don't know.

Why hasn't KSM used the Go client library by Prometheus to implement its metrics? The question arose because the client seems to already support Protocol Buffers.

https://kubernetes.slack.com/archives/CJJ529RUY/p1640638803044600

@invidian
Copy link
Member Author

invidian commented Jan 3, 2022

Thanks for looking into it @Serializator!

Why hasn't KSM used the Go client library by Prometheus to implement its metrics?

I forgot to mention that in the opening post. I think that would definitely be a preferable solution for this issue!

@Serializator
Copy link
Contributor

From @fpetkovski

The main reason for this is because KSM dumps a lot of metrics, especially in large clusters, and using the go client library has proven to be slow and memory intensive in the past.

This might be a useful read to get a bit more context prometheus/client_golang#917

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022
@fpetkovski
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022
@invidian
Copy link
Member Author

invidian commented Jul 3, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2022
@invidian
Copy link
Member Author

invidian commented Oct 1, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2022
@invidian
Copy link
Member Author

invidian commented Jan 2, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2023
@invidian
Copy link
Member Author

invidian commented Apr 2, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 2, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2023
@invidian
Copy link
Member Author

invidian commented Jul 3, 2023

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants