Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add api to list all alertmanager configs and rule groups #3529

Merged
merged 33 commits into from
Apr 21, 2021

Conversation

lzh-lab
Copy link
Contributor

@lzh-lab lzh-lab commented Nov 23, 2020

What this PR does:

add api to list all user alertmangaer configs

Which issue(s) this PR fixes:
Fixes #3530

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@lzh-lab lzh-lab changed the title add alertmanager all configs add api to list all alertmanager configs Nov 23, 2020
@lzh-lab lzh-lab changed the title add api to list all alertmanager configs add api to list all alertmanager configs and rule groups Nov 23, 2020
@pracucci
Copy link
Contributor

Thanks @AllenzhLi for your PR. @gotjosh could you do the first pass review, please?

@gotjosh
Copy link
Contributor

gotjosh commented Dec 9, 2020

Hi @AllenzhLi, apologies for the delay on this and thank you for your contribution - I have it on agenda to review this soon. Before I do, could you please add at least an unit test to this?

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! A couple of quick things I would add:

  • A CHANGELOG entry
  • Documentation of new endpoints to docs/api/_index.md
  • Unit tests

Signed-off-by: allenzhli <[email protected]>
@lzh-lab lzh-lab reopened this Dec 14, 2020
@lzh-lab
Copy link
Contributor Author

lzh-lab commented Dec 14, 2020

Thanks for your contribution! A couple of quick things I would add:

  • A CHANGELOG entry
  • Documentation of new endpoints to docs/api/_index.md
  • Unit tests

The above is done.

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on it, and apologies for super-slow review cycle.

I've tried to test this code (built Cortex, and started Alertmanager), but unfortunately it doesn't work. Reason is that StreamWriteResponse expects that response writer implements http.Flusher, but Cortex uses response writer internally, which doesn't do that. In that case StreamWriteResponse returns error, and never drains the channel (iter), which causes goroutines trying to call Put (send to channel) block forever.

Please do the following changes:

  • remove RespIterator
  • change StreamWriteResponse to not require http.Flusher. If response implements http.Flusher, then this method can call Flush method, but it should not require it.
  • change functions passed to concurrency.ForEachUser in both ListAllConfigs and ListAllRules as follows: Instead of calling iter.Put(data), use the following idiom (assuming iter is now channel):
select {
case iter <- data:
case <-done:
}

This allows both ListAllConfigs and ListAllRules to stop early, if sending response has already finished.

  • Instead of using done <- struct{}{} to signal that response has been sent, please use close(done) instead. The difference is that closed channel can be read repeatedly without blocking (it returns default value, and optional bool to indicate if read succeeded or channel was closed), while if we send empty struct to channel, this can be only read once -- perhaps by the select as shown above, but then additional read <-done at the end of ListAllConfigs / ListAllRules would never end.

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 10, 2021

  • change StreamWriteResponse to not require http.Flusher. If response implements http.Flusher, then this method can call Flush method, but it should not require it.

Only response implements http.Flusher, then we can get rule groups for batch of users and send(w.Write and http.Flusher) this partial output, and then repeat.

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

Do you have any other suggestions?

@pstibrany
Copy link
Contributor

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

This sounds very suspicious. It’s perfectly valid to call response.Write multiple times, and all those writes will be sent to HTTP client. I would expect that simply removing Flush calls should not affect the test, if it’s written correctly (which I haven’t checked).

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 11, 2021

If no http.Flusher method, test found that the http.ResponseWriter only accept the first w.Write(partial_content), so we must cache all content and do w.Write(all_content).

This sounds very suspicious. It’s perfectly valid to call response.Write multiple times, and all those writes will be sent to HTTP client. I would expect that simply removing Flush calls should not affect the test, if it’s written correctly (which I haven’t checked).

You are right. I have make a misstake.

The new commit have implement stream write yaml to http.ResponseWriter.

I also update Content-type to application/yaml which will become a file to download in chrome. This action is keep same with the GetUserConfig API.

I have make a three tenants alertmanger test, and use chrome access the /multitenant_alertmanager/configs, then I get a file download and the content of the file as follow:

u3:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>
u1:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>
u2:
  template_files: {}
  alertmanager_config: |
    global:
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'localhost:25'
      smtp_from: '[email protected]'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'

    # The directory from which notification templates are read.
    templates:
    - '/etc/alertmanager/template/*.tmpl'

    # The root route on which each incoming alert enters.
    route:
      # The labels by which incoming alerts are grouped together. For example,
      # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
      # be batched into a single group.
      #
      # To aggregate by all possible labels use '...' as the sole label name.
      # This effectively disables aggregation entirely, passing through all
      # alerts as-is. This is unlikely to be what you want, unless you have
      # a very low alert volume or your upstream notification system performs
      # its own grouping. Example: group_by: [...]
      group_by: ['alertname', 'cluster', 'service']

      # When a new group of alerts is created by an incoming alert, wait at
      # least 'group_wait' to send the initial notification.
      # This way ensures that you get multiple alerts for the same group that start
      # firing shortly after another are batched together on the first
      # notification.
      group_wait: 30s

      # When the first notification was sent, wait 'group_interval' to send a batch
      # of new alerts that started firing for that group.
      group_interval: 5m

      # If an alert has successfully been sent, wait 'repeat_interval' to
      # resend them.
      repeat_interval: 3h

      # A default receiver
      receiver: team-X-mails

      # All the above attributes are inherited by all child routes and can
      # overwritten on each.

      # The child route trees.
      routes:
      # This routes performs a regular expression match on alert labels to
      # catch alerts that are related to a list of services.
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails
        # The service has a sub-route for critical alerts, any alerts
        # that do not match, i.e. severity != critical, fall-back to the
        # parent node and are sent to 'team-X-mails'
        routes:
        - match:
            severity: critical
          receiver: team-X-pager
      - match:
          service: files
        receiver: team-Y-mails

        routes:
        - match:
            severity: critical
          receiver: team-Y-pager

      # This route handles all alerts coming from a database service. If there's
      # no team to handle it, it defaults to the DB team.
      - match:
          service: database
        receiver: team-DB-pager
        # Also group alerts by affected database.
        group_by: [alertname, cluster, database]
        routes:
        - match:
            owner: team-X
          receiver: team-X-pager
          continue: true
        - match:
            owner: team-Y
          receiver: team-Y-pager


    # Inhibition rules allow to mute a set of alerts given that another alert is
    # firing.
    # We use this to mute any warning-level notifications if the same alert is
    # already critical.
    inhibit_rules:
    - source_match:
        severity: 'critical'
      target_match:
        severity: 'warning'
      # Apply inhibition if the alertname is the same.
      # CAUTION:
      #   If all label names listed in `equal` are missing
      #   from both the source and target alerts,
      #   the inhibition rule will apply!
      equal: ['alertname', 'cluster', 'service']


    receivers:
    - name: 'team-X-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-X-pager'
      email_configs:
      - to: '[email protected]'
      pagerduty_configs:
      - service_key: <team-X-key>

    - name: 'team-Y-mails'
      email_configs:
      - to: '[email protected]'

    - name: 'team-Y-pager'
      pagerduty_configs:
      - service_key: <team-Y-key>

    - name: 'team-DB-pager'
      pagerduty_configs:
      - service_key: <team-DB-key>

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 12, 2021

Just now, I have test one hundred tenants alertmanager instance, work well.

When I increase the number of instance to three hundred, get err curl: (18) transfer closed with outstanding read data remaining and cortex get errcontext canceled

My test env is 8 core, 16G mem.

Seems the problem is the context canceled. I will do more testing.

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 12, 2021

When I update the http_server_write_timeout from default 30s to 5m all is work well.

count time
<=300 <=30s
400 34s
545 1m04s
1100 4m20s

@lzh-lab lzh-lab requested a review from pstibrany April 13, 2021 02:33
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my feedback. I have few more comments, but this PR is almost ready now.

Signed-off-by: allenzhli <[email protected]>
@pstibrany
Copy link
Contributor

Latest changes added two extra files:

  • .vscode/launch.json
  • docs/configuration/single-process-config-alertmanager.yaml

Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a good state now (after fixing extra files and typos).

Thank you very much for working on this!

@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 15, 2021

Aleady fixing extra files and typos.

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! LGTM modulo few last comments. Once addressed we can merge. Thanks for your patience! 🙏 🙇

Co-authored-by: Marco Pracucci <[email protected]>

Update CHANGELOG.md

Co-authored-by: Marco Pracucci <[email protected]>

Update pkg/ruler/ruler_test.go

Co-authored-by: Marco Pracucci <[email protected]>

Update pkg/util/http_test.go

Co-authored-by: Marco Pracucci <[email protected]>

Update pkg/alertmanager/api_test.go

Co-authored-by: Marco Pracucci <[email protected]>

Update pkg/alertmanager/api_test.go

Co-authored-by: Marco Pracucci <[email protected]>

Update pkg/alertmanager/api.go

Co-authored-by: Marco Pracucci <[email protected]>

fix review comments

Signed-off-by: allenzhli <[email protected]>
@lzh-lab
Copy link
Contributor Author

lzh-lab commented Apr 17, 2021

Thanks for working on this! LGTM modulo few last comments. Once addressed we can merge. Thanks for your patience! 🙏 🙇

Thank you very much for your detailed review.

@lzh-lab lzh-lab requested a review from pracucci April 18, 2021 10:18
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to patiently address our feedback! ❤️ 🙏

@pracucci pracucci merged commit 206c9cd into cortexproject:master Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support list all users alertmanager configs and rule groups
4 participants