Skip to content

Remove metrics listener type in favor of http listener with metrics resource #18584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Jun 20, 2025

Remove metrics listener type in favor of http listener with metrics resource (which already exists).

This is spawning from wanting to refactor which registry the metrics are pulled from and noticing two sources of truth. The refactor itself is spawning from running multiple Synapse in the same Python process and wanting separate metrics (see #18592)

This will require homeserver admins to update their homeserver configs in some cases. I've drafted upgrade notes for the release below.

Split out from #18443

Part of #18592

Why did we have a separate metrics listener type?

See discussion below: #18584 (comment)

Draft upgrade notes

Drafted to make it easy to add to docs/upgrade.md for whoever is making the release.

Upgrade notes

Upgrading to vUPCOMING

Dedicated metrics listener removed in favor of metrics resource in the http listener

The dedicated type: metrics listener has been removed. Use the metrics resource in an type: http listener instead.

Example before:

homeserver.yaml

listeners:
  - port: 9000
    type: metrics
    bind_addresses: ['::1', '127.0.0.1']

Example after:

homeserver.yaml

listeners:
  - port: 9000
    type: http
    bind_addresses: ['::1', '127.0.0.1']
    resources:
      - names: [metrics]
        compress: false

Note: The endpoint path changes from /metrics to /_synapse/metrics - update your metrics scraper (e.g., Prometheus) accordingly.

See the Metrics How-to page for more information on how to setup and configure metrics.

Testing strategy

See behavior of previous metrics listener

  1. Add the metrics listener in your homeserver.yaml
    listeners:
      - port: 9323
        type: metrics
        bind_addresses: ['127.0.0.1']
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9323/metrics
  4. Observe response: example from this PR

See behavior of the http metrics resource

  1. Add the metrics resource to a new or existing http listeners in your homeserver.yaml
    listeners:
      - port: 9322
        type: http
        bind_addresses: ['127.0.0.1']
        resources:
          - names: [metrics]
            compress: false
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9322/_synapse/metrics (it's just a GET request so you can even do in the browser)
  4. Observe response: example, example from develop

There is no difference in the response from this PR to develop or from this resource to the metrics listener being removed.

Dev notes

  • metrics listener uses listen_metrics
  • http listener with metrics resource uses MetricsResource

Related PRs:

_set_prometheus_client_use_created_metrics

_set_prometheus_client_use_created_metrics() was added in matrix-org/synapse#13540 (2022-08-16)

But now this functionality has dedicated functions inprometheus_client via disable_created_metrics()/enable_created_metrics() which was added in prometheus/client_python#973 (2023-10-26). Part of [email protected] (2023-10-30).

Our deprecation policy states that we support whatever is provided by Debian oldstable. Which for python-prometheus-client is currently 0.6.0-1 (too old).

Todo

  • Figure out _set_prometheus_client_use_created_metrics

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 16:58 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 20:11 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 20:15 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 20:40 Active
@@ -0,0 +1 @@
Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).
Copy link
Contributor Author

@MadLittleMods MadLittleMods Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinions on calling out the upgrade notes from the changelog entry itself? It looks like usually we prefer to call it out at the top of the changelog section for the version. Doesn't seem like it would hurt to have it in both places. The only problem being the chicken and egg problem of the upgrade notes being written later in order to have the link filled in.

Suggested change
Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).
Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`). Please check the [relevant section in the upgrade notes](TODO).

@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 21:19 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 21:42 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 20, 2025 22:02 Active
Comment on lines +198 to +200
# `prometheus_client.metrics` was added in 0.5.0, so we require that too.
# We chose 0.6.0 as that is the current version in Debian Buster (oldstable).
prometheus-client = ">=0.6.0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change is here.

The rest is auto-formatting which I can remove from the diff if prompted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this file in favor of re-organizing the remaining pieces in more relevant file names.

Comment on lines -189 to -201
class PrometheusMetricsHackTestCase(unittest.HomeserverTestCase):
if parse_version(metadata.version("prometheus_client")) < parse_version("0.14.0"):
skip = "prometheus-client too old"

def test_created_metrics_disabled(self) -> None:
"""
Tests that a brittle hack, to disable `_created` metrics, works.
This involves poking at the internals of prometheus-client.
It's not the end of the world if this doesn't work.

This test gives us a way to notice if prometheus-client changes
their internals.
"""
Copy link
Contributor Author

@MadLittleMods MadLittleMods Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this test as it claims to "gives us a way to notice if prometheus-client changes their internals." but we already have that benefit from the hack function _set_prometheus_client_use_created_metrics itself (it will yell if the _use_created attribute is no longer used).

The test provides no further benefit beside asserting that the attribute was set which feels pretty useless. A proper test would ensure that before running the utility, we see the legacy _created metrics and after we don't.

@github-actions github-actions bot deployed to PR Documentation Preview June 23, 2025 16:58 Active
@github-actions github-actions bot deployed to PR Documentation Preview June 23, 2025 17:16 Active
Comment on lines -21 to -23
from prometheus_client import generate_latest

from synapse.metrics import REGISTRY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray thing that I noticed. Making it consistent that we pull from synapse.metrics for generate_latest since it's something we're explicitly exporting

__all__ = [
"Collector",
"MetricsResource",
"generate_latest",

@@ -0,0 +1 @@
Remove `metrics` listener (serves metrics at `/metrics`) type in favor of `http` listener with `metrics` resource (serves metrics at `/_synapse/metrics`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why metrics was its own listener type before. The only difference I can see is that they are served on different paths: /metrics vs /_synapse/metrics. Feels like we just need to include this in the upgrade notes (drafted in the PR description).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that @erikjohnston pointed out is that the dedicated metrics listener uses it's own thread and doesn't rely on the Twisted reactor so it still works regardless of how well Synapse is working. It's unclear if we really care about this benefit. It does have a theoretical benefit of being able to monitor a really bogged down Synapse.

Looking at the git history, the threading aspect just seems like the way it was originally implemented and not something explicitly called out for the benefit described above.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the updated conclusion from #18592 (comment) which most likely means we will continue to use the global REGISTRY, I think it's still worth distilling down to a single way to setup metrics; for the config clarity sake and for the codebase single source of truth sake.

(assuming we can't come up with a benefit/compelling reason for why we want both)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having thought about it a bit: I think it is worth keeping the separate metrics listener. I can't find where (and I looked), but I believe we have ported stuff to use the separate metrics listener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).

Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences. If needed its easy enough to change the metrics listener to not use the inbuilt prometheus one, but instead replace it by spawning a thread and starting a basic web server that responds to /metrics.

Copy link
Contributor Author

@MadLittleMods MadLittleMods Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate metrics listener in the past to get metrics from the process when its under extreme load (to the extent where twisted can't process new requests).

How sure are we that we care about this? Feels like a use case that we're just holding onto especially when we can't find any evidence.

From what I'm reading, threading.Thread (which prometheus_client.start_http_server_prometheus(...) uses), shares the same process and memory space so it doesn't even help us in the scenarios where Synapse is under extreme load.

[...] the threading module operates within a single process, meaning that all threads share the same memory space. However, the GIL limits the performance gains of threading when it comes to CPU-bound tasks, as only one thread can execute Python bytecode at a time.

-- https://docs.python.org/3/library/threading.html#gil-and-performance-considerations


Even if we keep the separate listener, I think we should still make them render from the same source and remove any differences.

I don't see an easy way forward as I don't know how to start another Twisted server in another thread in order to share the MetricsResource. It's a pretty simple resource so trying to abstract it any further isn't useful and we might as well keep the status-quo of disparate handling 🤷

I can pull out some of the clean-up from this PR to ship separately.

@github-actions github-actions bot deployed to PR Documentation Preview June 23, 2025 17:52 Active
@MadLittleMods MadLittleMods marked this pull request as ready for review June 23, 2025 19:26
@MadLittleMods MadLittleMods requested a review from a team as a code owner June 23, 2025 19:26
@github-actions github-actions bot deployed to PR Documentation Preview June 26, 2025 22:36 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants