add api to list all alertmanager configs and rule groups #3529

lzh-lab · 2020-11-23T12:54:48Z

What this PR does:

add api to list all user alertmangaer configs

Which issue(s) this PR fixes:
Fixes #3530

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci · 2020-11-24T15:42:36Z

Thanks @AllenzhLi for your PR. @gotjosh could you do the first pass review, please?

gotjosh · 2020-12-09T17:25:29Z

Hi @AllenzhLi, apologies for the delay on this and thank you for your contribution - I have it on agenda to review this soon. Before I do, could you please add at least an unit test to this?

pracucci

Thanks for your contribution! A couple of quick things I would add:

A CHANGELOG entry
Documentation of new endpoints to docs/api/_index.md
Unit tests

Signed-off-by: allenzhli <[email protected]>

lzh-lab · 2020-12-14T12:14:50Z

Thanks for your contribution! A couple of quick things I would add:

A CHANGELOG entry

Documentation of new endpoints to docs/api/_index.md

Unit tests

The above is done.

Signed-off-by: allenzhli <[email protected]>

pstibrany

Thank you for working on it, and apologies for super-slow review cycle.

I've tried to test this code (built Cortex, and started Alertmanager), but unfortunately it doesn't work. Reason is that StreamWriteResponse expects that response writer implements http.Flusher, but Cortex uses response writer internally, which doesn't do that. In that case StreamWriteResponse returns error, and never drains the channel (iter), which causes goroutines trying to call Put (send to channel) block forever.

Please do the following changes:

remove RespIterator
change StreamWriteResponse to not require http.Flusher. If response implements http.Flusher, then this method can call Flush method, but it should not require it.
change functions passed to concurrency.ForEachUser in both ListAllConfigs and ListAllRules as follows: Instead of calling iter.Put(data), use the following idiom (assuming iter is now channel):

select {
case iter <- data:
case <-done:
}

This allows both ListAllConfigs and ListAllRules to stop early, if sending response has already finished.

Instead of using done <- struct{}{} to signal that response has been sent, please use close(done) instead. The difference is that closed channel can be read repeatedly without blocking (it returns default value, and optional bool to indicate if read succeeded or channel was closed), while if we send empty struct to channel, this can be only read once -- perhaps by the select as shown above, but then additional read <-done at the end of ListAllConfigs / ListAllRules would never end.

pkg/util/http.go

Signed-off-by: allenzhli <[email protected]>

lzh-lab · 2021-04-10T09:47:15Z

change StreamWriteResponse to not require http.Flusher. If response implements http.Flusher, then this method can call Flush method, but it should not require it.

Only response implements http.Flusher, then we can get rule groups for batch of users and send(w.Write and http.Flusher) this partial output, and then repeat.

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

Do you have any other suggestions?

pstibrany · 2021-04-10T10:35:13Z

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

This sounds very suspicious. It’s perfectly valid to call response.Write multiple times, and all those writes will be sent to HTTP client. I would expect that simply removing Flush calls should not affect the test, if it’s written correctly (which I haven’t checked).

Signed-off-by: allenzhli <[email protected]>

lzh-lab · 2021-04-11T03:08:29Z

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

This sounds very suspicious. It’s perfectly valid to call response.Write multiple times, and all those writes will be sent to HTTP client. I would expect that simply removing Flush calls should not affect the test, if it’s written correctly (which I haven’t checked).

You are right. I have make a misstake.

The new commit have implement stream write yaml to http.ResponseWriter.

I also update Content-type to application/yaml which will become a file to download in chrome. This action is keep same with the GetUserConfig API.

I have make a three tenants alertmanger test, and use chrome access the /multitenant_alertmanager/configs, then I get a file download and the content of the file as follow:

u3:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>
u1:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>
u2:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>

lzh-lab · 2021-04-12T01:44:44Z

Just now, I have test one hundred tenants alertmanager instance, work well.

When I increase the number of instance to three hundred, get err curl: (18) transfer closed with outstanding read data remaining and cortex get errcontext canceled

My test env is 8 core, 16G mem.

Seems the problem is the context canceled. I will do more testing.

lzh-lab · 2021-04-12T03:01:33Z

When I update the http_server_write_timeout from default 30s to 5m all is work well.

count	time
<=300	<=30s
400	34s
545	1m04s
1100	4m20s

pstibrany

Thank you for addressing my feedback. I have few more comments, but this PR is almost ready now.

pkg/util/http.go

pkg/ruler/ruler.go

pkg/alertmanager/api_test.go

Signed-off-by: allenzhli <[email protected]>

pstibrany · 2021-04-14T15:46:31Z

Latest changes added two extra files:

.vscode/launch.json
docs/configuration/single-process-config-alertmanager.yaml

pstibrany

I think this is in a good state now (after fixing extra files and typos).

Thank you very much for working on this!

pkg/util/http.go

Signed-off-by: allenzhli <[email protected]>

lzh-lab · 2021-04-15T05:53:01Z

Aleady fixing extra files and typos.

pracucci

Thanks for working on this! LGTM modulo few last comments. Once addressed we can merge. Thanks for your patience! 🙏 🙇

CHANGELOG.md

pkg/ruler/ruler.go

pkg/ruler/ruler_test.go

pkg/util/http_test.go

pkg/alertmanager/api.go

pkg/alertmanager/api_test.go

Co-authored-by: Marco Pracucci <[email protected]> Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> Update pkg/ruler/ruler_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/util/http_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api.go Co-authored-by: Marco Pracucci <[email protected]> fix review comments Signed-off-by: allenzhli <[email protected]>

lzh-lab · 2021-04-17T06:13:09Z

Thanks for working on this! LGTM modulo few last comments. Once addressed we can merge. Thanks for your patience! 🙏 🙇

Thank you very much for your detailed review.

pracucci

Thanks to patiently address our feedback! ❤️ 🙏

pull-request-size bot added the size/M label Nov 23, 2020

lzh-lab changed the title ~~add alertmanager all configs~~ add api to list all alertmanager configs Nov 23, 2020

lzh-lab changed the title ~~add api to list all alertmanager configs~~ add api to list all alertmanager configs and rule groups Nov 23, 2020

lzh-lab force-pushed the master branch from 8d37a40 to b1c4d8f Compare November 23, 2020 13:07

pracucci added the component/alertmanager label Nov 24, 2020

pracucci reviewed Dec 10, 2020

View reviewed changes

pull-request-size bot added size/L and removed size/M labels Dec 14, 2020

lzh-lab force-pushed the master branch 6 times, most recently from 465a3fb to e776b50 Compare December 14, 2020 11:13

pull-request-size bot added size/XXL and removed size/L labels Dec 14, 2020

lzh-lab closed this Dec 14, 2020

lzh-lab force-pushed the master branch from e776b50 to 4c9a2af Compare December 14, 2020 11:18

pull-request-size bot added size/XS and removed size/XXL labels Dec 14, 2020

add test

ea7efcd

Signed-off-by: allenzhli <[email protected]>

lzh-lab reopened this Dec 14, 2020

pull-request-size bot added size/L and removed size/XS labels Dec 14, 2020

Merge branch 'master' of https://github.com/cortexproject/cortex

87536d8

Signed-off-by: allenzhli <[email protected]>

lzh-lab force-pushed the master branch from ef48f32 to 87536d8 Compare December 22, 2020 13:03

Merge branch 'master' of https://github.com/cortexproject/cortex

697cef9

Signed-off-by: allenzhli <[email protected]>

pstibrany suggested changes Apr 9, 2021

View reviewed changes

pkg/util/http.go Outdated Show resolved Hide resolved

remove RespIterator and update to use chan idiom

8b70952

Signed-off-by: allenzhli <[email protected]>

lzh-lab force-pushed the master branch from 0f6d317 to 8b70952 Compare April 10, 2021 09:32

allenzhli added 3 commits April 11, 2021 00:01

Merge branch 'master' of ssh://github.com/cortexproject/cortex

e7a2973

Signed-off-by: allenzhli <[email protected]>

del http.Flusher

dc1aabe

Signed-off-by: allenzhli <[email protected]>

update to StreamWriteYAMLResponse

f9853ef

Signed-off-by: allenzhli <[email protected]>

lzh-lab requested a review from pstibrany April 13, 2021 02:33

pstibrany reviewed Apr 14, 2021

View reviewed changes

lzh-lab force-pushed the master branch from 8f1ac8b to 8392d66 Compare April 14, 2021 13:05

update review comments

a435901

Signed-off-by: allenzhli <[email protected]>

lzh-lab force-pushed the master branch from 8392d66 to a435901 Compare April 14, 2021 13:13

Merge branch 'master' of ssh://github.com/cortexproject/cortex

d21ee9d

Signed-off-by: allenzhli <[email protected]>

pstibrany approved these changes Apr 14, 2021

View reviewed changes

pkg/util/http.go Outdated Show resolved Hide resolved

pkg/util/http.go Outdated Show resolved Hide resolved

allenzhli added 2 commits April 15, 2021 10:20

del .vscode and fix typos

bf47d11

Signed-off-by: allenzhli <[email protected]>

Merge branch 'master' of ssh://github.com/cortexproject/cortex

cf87955

Signed-off-by: allenzhli <[email protected]>

pracucci reviewed Apr 16, 2021

View reviewed changes

lzh-lab force-pushed the master branch from 45b912f to 308e038 Compare April 17, 2021 05:29

lzh-lab requested a review from pracucci April 18, 2021 10:18

pracucci approved these changes Apr 21, 2021

View reviewed changes

pracucci merged commit 206c9cd into cortexproject:master Apr 21, 2021

pstibrany mentioned this pull request May 12, 2021

/ruler/rule_groups doesn't work with object stores #4177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add api to list all alertmanager configs and rule groups #3529

add api to list all alertmanager configs and rule groups #3529

lzh-lab commented Nov 23, 2020 •

edited

Loading

pracucci commented Nov 24, 2020

gotjosh commented Dec 9, 2020

pracucci left a comment

lzh-lab commented Dec 14, 2020

pstibrany left a comment •

edited

Loading

lzh-lab commented Apr 10, 2021

pstibrany commented Apr 10, 2021

lzh-lab commented Apr 11, 2021

lzh-lab commented Apr 12, 2021 •

edited

Loading

lzh-lab commented Apr 12, 2021 •

edited

Loading

pstibrany left a comment

pstibrany commented Apr 14, 2021

pstibrany left a comment

lzh-lab commented Apr 15, 2021

pracucci left a comment

lzh-lab commented Apr 17, 2021

pracucci left a comment

add api to list all alertmanager configs and rule groups #3529

add api to list all alertmanager configs and rule groups #3529

Conversation

lzh-lab commented Nov 23, 2020 • edited Loading

pracucci commented Nov 24, 2020

gotjosh commented Dec 9, 2020

pracucci left a comment

Choose a reason for hiding this comment

lzh-lab commented Dec 14, 2020

pstibrany left a comment • edited Loading

Choose a reason for hiding this comment

lzh-lab commented Apr 10, 2021

pstibrany commented Apr 10, 2021

lzh-lab commented Apr 11, 2021

lzh-lab commented Apr 12, 2021 • edited Loading

lzh-lab commented Apr 12, 2021 • edited Loading

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany commented Apr 14, 2021

pstibrany left a comment

Choose a reason for hiding this comment

lzh-lab commented Apr 15, 2021

pracucci left a comment

Choose a reason for hiding this comment

lzh-lab commented Apr 17, 2021

pracucci left a comment

Choose a reason for hiding this comment

lzh-lab commented Nov 23, 2020 •

edited

Loading

pstibrany left a comment •

edited

Loading

lzh-lab commented Apr 12, 2021 •

edited

Loading

lzh-lab commented Apr 12, 2021 •

edited

Loading