Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add api to list all alertmanager configs and rule groups #3529

Merged
merged 33 commits into from
Apr 21, 2021
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ea7efcd
add test
Dec 14, 2020
87536d8
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 22, 2020
697cef9
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 28, 2020
bc55d2c
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 30, 2020
2def9ee
Merge branch 'master' of https://github.com/cortexproject/cortex
Jan 7, 2021
f460fae
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Jan 13, 2021
4bb4a5e
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Jan 16, 2021
2dcc36e
update change log
Jan 22, 2021
dc2b382
fix api_test
Jan 22, 2021
38059c9
merge and update
Feb 24, 2021
be23008
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 8, 2021
affdc3e
fix
Mar 8, 2021
13489c4
make clean-white-noise
Mar 8, 2021
f412546
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 8, 2021
17f213a
remove merge conflict marks
Mar 9, 2021
5c1ce17
update reviews
Mar 9, 2021
eed31b0
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 9, 2021
1ee92b8
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 16, 2021
bdbc27d
fix test error
Mar 16, 2021
9ef542a
improve list all user configs
Mar 17, 2021
65476ae
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 17, 2021
a3b0adf
fix
Mar 17, 2021
f212c25
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 18, 2021
5c1796b
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 30, 2021
8b70952
remove RespIterator and update to use chan idiom
Apr 10, 2021
e7a2973
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 10, 2021
dc1aabe
del http.Flusher
Apr 11, 2021
f9853ef
update to StreamWriteYAMLResponse
Apr 11, 2021
a435901
update review comments
Apr 14, 2021
d21ee9d
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 14, 2021
bf47d11
del .vscode and fix typos
Apr 15, 2021
cf87955
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 15, 2021
308e038
Update CHANGELOG.md
Apr 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
// 使用 IntelliSense 了解相关属性。
// 悬停以查看现有属性的描述。
// 欲了解更多信息,请访问: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "alertmanager",
"type": "go",
"request": "launch",
"mode": "debug",
"program": "${workspaceFolder}/cmd/cortex",
"args": [
"--config.file=${workspaceFolder}/docs/configuration/single-process-config-alertmanager.yaml"
]
}
]
}
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
* [ENHANCEMENT] Ruler: added the following metrics when ruler sharding is enabled: #3916
* `cortex_ruler_clients`
* `cortex_ruler_client_request_duration_seconds`
* [ENHANCEMENT] Alertmanager: Add API endpoint to list all tenant alertmanager configs: `GET /multitenant_alertmanager/configs`. #3259
* [ENHANCEMENT] Ruler: Add API endpoint to list all tenant ruler rule groups: `GET /ruler/rule_groups`. #3259
* [ENHANCEMENT] Query-frontend/scheduler: added querier forget delay (`-query-frontend.querier-forget-delay` and `-query-scheduler.querier-forget-delay`) to mitigate the blast radius in the event queriers crash because of a repeatedly sent "query of death" when shuffle-sharding is enabled. #3901
* [ENHANCEMENT] Query-frontend: reduced memory allocations when serializing query response. #3964
* [ENHANCEMENT] Querier / ruler: some optimizations to PromQL query engine. #3934 #3989
Expand Down
18 changes: 18 additions & 0 deletions docs/api/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
| [Get tenant ingestion stats](#get-tenant-ingestion-stats) | Querier | `GET /api/v1/user_stats` |
| [Get tenant chunks](#get-tenant-chunks) | Querier | `GET /api/v1/chunks` |
| [Ruler ring status](#ruler-ring-status) | Ruler | `GET /ruler/ring` |
| [Ruler rules ](#ruler-rule-groups) | Ruler | `GET /ruler/rule_groups` |
| [List rules](#list-rules) | Ruler | `GET <prometheus-http-prefix>/api/v1/rules` |
| [List alerts](#list-alerts) | Ruler | `GET <prometheus-http-prefix>/api/v1/alerts` |
| [List rule groups](#list-rule-groups) | Ruler | `GET /api/v1/rules` |
Expand All @@ -51,6 +52,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
| [Delete namespace](#delete-namespace) | Ruler | `DELETE /api/v1/rules/{namespace}` |
| [Delete tenant configuration](#delete-tenant-configuration) | Ruler | `POST /ruler/delete_tenant_config` |
| [Alertmanager status](#alertmanager-status) | Alertmanager | `GET /multitenant_alertmanager/status` |
| [Alertmanager configs](#alertmanager-configs) | Alertmanager | `GET /multitenant_alertmanager/configs` |
| [Alertmanager ring status](#alertmanager-ring-status) | Alertmanager | `GET /multitenant_alertmanager/ring` |
| [Alertmanager UI](#alertmanager-ui) | Alertmanager | `GET /<alertmanager-http-prefix>` |
| [Alertmanager Delete Tenant Configuration](#alertmanager-delete-tenant-configuration) | Alertmanager | `POST /multitenant_alertmanager/delete_tenant_config` |
Expand Down Expand Up @@ -431,6 +433,14 @@ GET /ruler_ring

Displays a web page with the ruler hash ring status, including the state, healthy and last heartbeat time of each ruler.

### Ruler rules

```
GET /ruler/rule_groups
```

List all tenant rules. This endpoint is not part of ruler-API and is always available regardless of whether ruler-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the rule groups for each tenant and `200` status code on success.

### List rules

```
Expand Down Expand Up @@ -663,6 +673,14 @@ GET /status

Displays a web page with the current status of the Alertmanager, including the Alertmanager cluster members.

### Alertmanager configs

```
GET /multitenant_alertmanager/configs
```

List all Alertmanager configurations. This endpoint is not part of alertmanager-API and is always available regardless of whether alertmanager-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the Alertmanager configurations and `200` status code on success.

### Alertmanager ring status

```
Expand Down
28 changes: 28 additions & 0 deletions docs/configuration/single-process-config-alertmanager.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

# Configuration for running Cortex in single-process mode.
# This configuration should not be used in production.
# It is only for getting started and development.

# Disable the requirement that every request to Cortex has a
# X-Scope-OrgID header. `fake` will be substituted in instead.
auth_enabled: true

target: alertmanager

server:
http_listen_port: 9009
log_level: debug

# Configure the server to allow messages up to 100MB.
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
grpc_server_max_concurrent_streams: 1000


alertmanager:
external_url: http://127.0.0.1/
storage:
type: local
local:
path: /data/cortex/data/amconfig/

54 changes: 54 additions & 0 deletions pkg/alertmanager/api.go
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
package alertmanager

import (
"context"
"fmt"
"io/ioutil"
"net/http"
"os"
"path/filepath"

"github.com/pkg/errors"

"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
"github.com/cortexproject/cortex/pkg/chunk"
"github.com/cortexproject/cortex/pkg/tenant"
"github.com/cortexproject/cortex/pkg/util"
"github.com/cortexproject/cortex/pkg/util/concurrency"
util_log "github.com/cortexproject/cortex/pkg/util/log"

"github.com/go-kit/kit/log"
Expand All @@ -25,6 +31,9 @@ const (
errStoringConfiguration = "unable to store the Alertmanager config"
errDeletingConfiguration = "unable to delete the Alertmanager config"
errNoOrgID = "unable to determine the OrgID"
errListAllUser = "unable to list the Alertmanager users"

fetchConcurrency = 16
)

// UserConfig is used to communicate a users alertmanager configs
Expand Down Expand Up @@ -184,3 +193,48 @@ func validateUserConfig(logger log.Logger, cfg alertspb.AlertConfigDesc) error {

return nil
}

func (am *MultitenantAlertmanager) ListAllConfigs(w http.ResponseWriter, r *http.Request) {
logger := util_log.WithContext(r.Context(), am.logger)
userIDs, err := am.store.ListAllUsers(r.Context())
if err != nil {
level.Error(logger).Log("msg", "failed to list users of alertmanager", "err", err)
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
return
}

done := make(chan struct{})
iter := make(chan interface{})

go func() {
util.StreamWriteYAMLResponse(w, iter, logger)
close(done)
}()

err = concurrency.ForEachUser(r.Context(), userIDs, fetchConcurrency, func(ctx context.Context, userID string) error {
cfg, err := am.store.GetAlertConfig(ctx, userID)
if errors.Is(err, chunk.ErrStorageObjectNotFound) {
return nil
} else if err != nil {
return errors.Wrapf(err, "failed to fetch alertmanager config for user %s", userID)
}
data := map[string]*UserConfig{
userID: {
TemplateFiles: alertspb.ParseTemplates(cfg),
AlertmanagerConfig: cfg.RawConfig,
},
}

select {
case iter <- data:
case <-done: // stop early, if sending response has already finished
}

return nil
})
if err != nil {
level.Error(logger).Log("msg", "failed to list all alertmanager configs", "err", err)
}
close(iter)
<-done
}
88 changes: 88 additions & 0 deletions pkg/alertmanager/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,18 @@ import (
"testing"

"github.com/go-kit/kit/log"
"github.com/gorilla/mux"
"github.com/prometheus/client_golang/prometheus"
"gopkg.in/yaml.v2"

"github.com/thanos-io/thanos/pkg/objstore"

"github.com/cortexproject/cortex/pkg/util/flagext"

"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
"github.com/cortexproject/cortex/pkg/alertmanager/alertstore/bucketclient"
util_log "github.com/cortexproject/cortex/pkg/util/log"
"github.com/cortexproject/cortex/pkg/util/services"

"github.com/stretchr/testify/require"
"github.com/weaveworks/common/user"
Expand Down Expand Up @@ -199,3 +206,84 @@ func TestMultitenantAlertmanager_DeleteUserConfig(t *testing.T) {
require.Equal(t, 0, len(storage.Objects()))
}
}

func TestAMConfigListUserConfig(t *testing.T) {
testCases := map[string]*UserConfig{
"user1": {
AlertmanagerConfig: `
global:
resolve_timeout: 5m
route:
receiver: route1
group_by:
- '...'
continue: false
receivers:
- name: route1
webhook_configs:
- send_resolved: true
http_config: {}
url: http://alertmanager/api/notifications?orgId=1&rrid=7
max_alerts: 0
`,
},
"user2": {
AlertmanagerConfig: `
global:
resolve_timeout: 5m
route:
receiver: route1
group_by:
- '...'
continue: false
receivers:
- name: route1
webhook_configs:
- send_resolved: true
http_config: {}
url: http://alertmanager/api/notifications?orgId=2&rrid=7
max_alerts: 0
`,
},
}

storage := objstore.NewInMemBucket()
alertStore := bucketclient.NewBucketAlertStore(storage, nil, log.NewNopLogger())

for u, cfg := range testCases {
err := alertStore.SetAlertConfig(context.Background(), alertspb.AlertConfigDesc{
User: u,
RawConfig: cfg.AlertmanagerConfig,
})
require.NoError(t, err)
}

externalURL := flagext.URLValue{}
err := externalURL.Set("http://localhost:8080/alertmanager")
require.NoError(t, err)

// Create the Multitenant Alertmanager.
reg := prometheus.NewPedanticRegistry()
cfg := mockAlertmanagerConfig(t)
am, err := createMultitenantAlertmanager(cfg, nil, nil, alertStore, nil, log.NewNopLogger(), reg)
require.NoError(t, err)
require.NoError(t, services.StartAndAwaitRunning(context.Background(), am))
defer services.StopAndAwaitTerminated(context.Background(), am) //nolint:errcheck

err = am.loadAndSyncConfigs(context.Background(), reasonPeriodic)
require.NoError(t, err)
require.Len(t, am.alertmanagers, 2)

router := mux.NewRouter()
router.Path("/multitenant_alertmanager/configs").Methods(http.MethodGet).HandlerFunc(am.ListAllConfigs)
req := httptest.NewRequest("GET", "https://localhost:8080/multitenant_alertmanager/configs", nil)
w := httptest.NewRecorder()
router.ServeHTTP(w, req)

resp := w.Result()
require.Equal(t, http.StatusOK, resp.StatusCode)
require.Equal(t, "application/yaml", resp.Header.Get("Content-Type"))
body, _ := ioutil.ReadAll(resp.Body)
old, _ := yaml.Marshal(testCases)
require.YAMLEq(t, string(old), string(body))
}
4 changes: 4 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ func (a *API) RegisterAlertmanager(am *alertmanager.MultitenantAlertmanager, tar
a.indexPage.AddLink(SectionAdminEndpoints, "/multitenant_alertmanager/ring", "Alertmanager Ring Status")
// Ensure this route is registered before the prefixed AM route
a.RegisterRoute("/multitenant_alertmanager/status", am.GetStatusHandler(), false, "GET")
a.RegisterRoute("/multitenant_alertmanager/configs", http.HandlerFunc(am.ListAllConfigs), false, "GET")
a.RegisterRoute("/multitenant_alertmanager/ring", http.HandlerFunc(am.RingHandler), false, "GET", "POST")
a.RegisterRoute("/multitenant_alertmanager/delete_tenant_config", http.HandlerFunc(am.DeleteUserConfig), true, "POST")

Expand Down Expand Up @@ -285,6 +286,9 @@ func (a *API) RegisterRuler(r *ruler.Ruler) {
// Legacy Ring Route
a.RegisterRoute("/ruler_ring", r, false, "GET", "POST")

// List all user rule groups
a.RegisterRoute("/ruler/rule_groups", http.HandlerFunc(r.ListAllRules), false, "GET")

ruler.RegisterRulerServer(a.server.GRPC, r)
}

Expand Down
48 changes: 47 additions & 1 deletion pkg/ruler/ruler.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ import (
"github.com/weaveworks/common/user"
"golang.org/x/sync/errgroup"

"github.com/cortexproject/cortex/pkg/chunk"
"github.com/cortexproject/cortex/pkg/cortexpb"
"github.com/cortexproject/cortex/pkg/ring"
ring_client "github.com/cortexproject/cortex/pkg/ring/client"
Expand All @@ -50,7 +51,8 @@ var (

const (
// Number of concurrent group list and group loads operations.
loadRulesConcurrency = 10
loadRulesConcurrency = 10
fetchRulesConcurrency = 16

rulerSyncReasonInitial = "initial"
rulerSyncReasonPeriodic = "periodic"
Expand All @@ -59,6 +61,9 @@ const (
// Limit errors
errMaxRuleGroupsPerUserLimitExceeded = "per-user rule groups limit (limit: %d actual: %d) exceeded"
errMaxRulesPerRuleGroupPerUserLimitExceeded = "per-user rules per rule group limit (limit: %d actual: %d) exceeded"

// errors
errListAllUser = "unable to list the ruler users"
)

// Config is the configuration for the recording rules server.
Expand Down Expand Up @@ -840,3 +845,44 @@ func (r *Ruler) DeleteTenantConfiguration(w http.ResponseWriter, req *http.Reque
level.Info(logger).Log("msg", "deleted all tenant rule groups", "user", userID)
w.WriteHeader(http.StatusOK)
}

func (r *Ruler) ListAllRules(w http.ResponseWriter, req *http.Request) {
logger := util_log.WithContext(req.Context(), r.logger)

userIDs, err := r.store.ListAllUsers(req.Context())
if err != nil {
level.Error(logger).Log("msg", errListAllUser, "err", err)
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
return
}

done := make(chan struct{})
iter := make(chan interface{})

go func() {
util.StreamWriteYAMLResponse(w, iter, logger)
close(done)
}()

err = concurrency.ForEachUser(req.Context(), userIDs, fetchRulesConcurrency, func(ctx context.Context, userID string) error {
rg, err := r.store.ListRuleGroupsForUserAndNamespace(ctx, userID, "")
if errors.Is(err, chunk.ErrStorageObjectNotFound) {
return nil
} else if err != nil {
return errors.Wrapf(err, "failed to fetch ruler config for user %s", userID)
}
data := map[string]map[string][]rulefmt.RuleGroup{userID: rg.Formatted()}

select {
case iter <- data:
case <-done: // stop early, if sending response has already finished
}

return nil
})
if err != nil {
level.Error(logger).Log("msg", "failed to list all ruler configs", "err", err)
}
close(iter)
<-done
}
Loading