Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add api to list all alertmanager configs and rule groups #3529

Merged
merged 33 commits into from
Apr 21, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
ea7efcd
add test
Dec 14, 2020
87536d8
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 22, 2020
697cef9
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 28, 2020
bc55d2c
Merge branch 'master' of https://github.com/cortexproject/cortex
Dec 30, 2020
2def9ee
Merge branch 'master' of https://github.com/cortexproject/cortex
Jan 7, 2021
f460fae
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Jan 13, 2021
4bb4a5e
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Jan 16, 2021
2dcc36e
update change log
Jan 22, 2021
dc2b382
fix api_test
Jan 22, 2021
38059c9
merge and update
Feb 24, 2021
be23008
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 8, 2021
affdc3e
fix
Mar 8, 2021
13489c4
make clean-white-noise
Mar 8, 2021
f412546
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 8, 2021
17f213a
remove merge conflict marks
Mar 9, 2021
5c1ce17
update reviews
Mar 9, 2021
eed31b0
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 9, 2021
1ee92b8
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 16, 2021
bdbc27d
fix test error
Mar 16, 2021
9ef542a
improve list all user configs
Mar 17, 2021
65476ae
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 17, 2021
a3b0adf
fix
Mar 17, 2021
f212c25
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 18, 2021
5c1796b
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Mar 30, 2021
8b70952
remove RespIterator and update to use chan idiom
Apr 10, 2021
e7a2973
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 10, 2021
dc1aabe
del http.Flusher
Apr 11, 2021
f9853ef
update to StreamWriteYAMLResponse
Apr 11, 2021
a435901
update review comments
Apr 14, 2021
d21ee9d
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 14, 2021
bf47d11
del .vscode and fix typos
Apr 15, 2021
cf87955
Merge branch 'master' of ssh://github.com/cortexproject/cortex
Apr 15, 2021
308e038
Update CHANGELOG.md
Apr 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
* [ENHANCEMENT] Ruler: added the following metrics when ruler sharding is enabled: #3916
* `cortex_ruler_clients`
* `cortex_ruler_client_request_duration_seconds`
* [ENHANCEMENT] Alertmanager: Add API endpoint to list all tenant alertmanager configs: `GET /multitenant_alertmanager/configs`. #3529
* [ENHANCEMENT] Ruler: Add API endpoint to list all tenant ruler rule groups: `GET /ruler/rule_groups`. #3529
* [ENHANCEMENT] Query-frontend/scheduler: added querier forget delay (`-query-frontend.querier-forget-delay` and `-query-scheduler.querier-forget-delay`) to mitigate the blast radius in the event queriers crash because of a repeatedly sent "query of death" when shuffle-sharding is enabled. #3901
* [ENHANCEMENT] Query-frontend: reduced memory allocations when serializing query response. #3964
* [ENHANCEMENT] Querier / ruler: some optimizations to PromQL query engine. #3934 #3989
Expand Down
18 changes: 18 additions & 0 deletions docs/api/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
| [Get tenant ingestion stats](#get-tenant-ingestion-stats) | Querier | `GET /api/v1/user_stats` |
| [Get tenant chunks](#get-tenant-chunks) | Querier | `GET /api/v1/chunks` |
| [Ruler ring status](#ruler-ring-status) | Ruler | `GET /ruler/ring` |
| [Ruler rules ](#ruler-rule-groups) | Ruler | `GET /ruler/rule_groups` |
| [List rules](#list-rules) | Ruler | `GET <prometheus-http-prefix>/api/v1/rules` |
| [List alerts](#list-alerts) | Ruler | `GET <prometheus-http-prefix>/api/v1/alerts` |
| [List rule groups](#list-rule-groups) | Ruler | `GET /api/v1/rules` |
Expand All @@ -51,6 +52,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
| [Delete namespace](#delete-namespace) | Ruler | `DELETE /api/v1/rules/{namespace}` |
| [Delete tenant configuration](#delete-tenant-configuration) | Ruler | `POST /ruler/delete_tenant_config` |
| [Alertmanager status](#alertmanager-status) | Alertmanager | `GET /multitenant_alertmanager/status` |
| [Alertmanager configs](#alertmanager-configs) | Alertmanager | `GET /multitenant_alertmanager/configs` |
| [Alertmanager ring status](#alertmanager-ring-status) | Alertmanager | `GET /multitenant_alertmanager/ring` |
| [Alertmanager UI](#alertmanager-ui) | Alertmanager | `GET /<alertmanager-http-prefix>` |
| [Alertmanager Delete Tenant Configuration](#alertmanager-delete-tenant-configuration) | Alertmanager | `POST /multitenant_alertmanager/delete_tenant_config` |
Expand Down Expand Up @@ -431,6 +433,14 @@ GET /ruler_ring

Displays a web page with the ruler hash ring status, including the state, healthy and last heartbeat time of each ruler.

### Ruler rules

```
GET /ruler/rule_groups
```

List all tenant rules. This endpoint is not part of ruler-API and is always available regardless of whether ruler-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the rule groups for each tenant and `200` status code on success.

### List rules

```
Expand Down Expand Up @@ -663,6 +673,14 @@ GET /status

Displays a web page with the current status of the Alertmanager, including the Alertmanager cluster members.

### Alertmanager configs

```
GET /multitenant_alertmanager/configs
```

List all Alertmanager configurations. This endpoint is not part of alertmanager-API and is always available regardless of whether alertmanager-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the Alertmanager configurations and `200` status code on success.

### Alertmanager ring status

```
Expand Down
53 changes: 53 additions & 0 deletions pkg/alertmanager/api.go
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
package alertmanager

import (
"context"
"fmt"
"io/ioutil"
"net/http"
"os"
"path/filepath"

"github.com/pkg/errors"

"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
"github.com/cortexproject/cortex/pkg/tenant"
"github.com/cortexproject/cortex/pkg/util"
"github.com/cortexproject/cortex/pkg/util/concurrency"
util_log "github.com/cortexproject/cortex/pkg/util/log"

"github.com/go-kit/kit/log"
Expand All @@ -25,6 +30,9 @@ const (
errStoringConfiguration = "unable to store the Alertmanager config"
errDeletingConfiguration = "unable to delete the Alertmanager config"
errNoOrgID = "unable to determine the OrgID"
errListAllUser = "unable to list the Alertmanager users"

fetchConcurrency = 16
)

// UserConfig is used to communicate a users alertmanager configs
Expand Down Expand Up @@ -184,3 +192,48 @@ func validateUserConfig(logger log.Logger, cfg alertspb.AlertConfigDesc) error {

return nil
}

func (am *MultitenantAlertmanager) ListAllConfigs(w http.ResponseWriter, r *http.Request) {
logger := util_log.WithContext(r.Context(), am.logger)
userIDs, err := am.store.ListAllUsers(r.Context())
if err != nil {
level.Error(logger).Log("msg", "failed to list users of alertmanager", "err", err)
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
return
}

done := make(chan struct{})
iter := make(chan interface{})

go func() {
util.StreamWriteYAMLResponse(w, iter, logger)
close(done)
}()

err = concurrency.ForEachUser(r.Context(), userIDs, fetchConcurrency, func(ctx context.Context, userID string) error {
cfg, err := am.store.GetAlertConfig(ctx, userID)
if errors.Is(err, alertspb.ErrNotFound) {
return nil
} else if err != nil {
return errors.Wrapf(err, "failed to fetch alertmanager config for user %s", userID)
}
data := map[string]*UserConfig{
userID: {
TemplateFiles: alertspb.ParseTemplates(cfg),
AlertmanagerConfig: cfg.RawConfig,
},
}

select {
case iter <- data:
case <-done: // stop early, if sending response has already finished
}

return nil
})
if err != nil {
level.Error(logger).Log("msg", "failed to list all alertmanager configs", "err", err)
}
close(iter)
<-done
}
90 changes: 90 additions & 0 deletions pkg/alertmanager/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,18 @@ import (
"testing"

"github.com/go-kit/kit/log"
"github.com/gorilla/mux"
"github.com/prometheus/client_golang/prometheus"
"gopkg.in/yaml.v2"

"github.com/thanos-io/thanos/pkg/objstore"

"github.com/cortexproject/cortex/pkg/util/flagext"

"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
"github.com/cortexproject/cortex/pkg/alertmanager/alertstore/bucketclient"
util_log "github.com/cortexproject/cortex/pkg/util/log"
"github.com/cortexproject/cortex/pkg/util/services"

"github.com/stretchr/testify/require"
"github.com/weaveworks/common/user"
Expand Down Expand Up @@ -199,3 +206,86 @@ func TestMultitenantAlertmanager_DeleteUserConfig(t *testing.T) {
require.Equal(t, 0, len(storage.Objects()))
}
}

func TestAMConfigListUserConfig(t *testing.T) {
testCases := map[string]*UserConfig{
"user1": {
AlertmanagerConfig: `
global:
resolve_timeout: 5m
route:
receiver: route1
group_by:
- '...'
continue: false
receivers:
- name: route1
webhook_configs:
- send_resolved: true
http_config: {}
url: http://alertmanager/api/notifications?orgId=1&rrid=7
max_alerts: 0
`,
},
"user2": {
AlertmanagerConfig: `
global:
resolve_timeout: 5m
route:
receiver: route1
group_by:
- '...'
continue: false
receivers:
- name: route1
webhook_configs:
- send_resolved: true
http_config: {}
url: http://alertmanager/api/notifications?orgId=2&rrid=7
max_alerts: 0
`,
},
}

storage := objstore.NewInMemBucket()
alertStore := bucketclient.NewBucketAlertStore(storage, nil, log.NewNopLogger())

for u, cfg := range testCases {
err := alertStore.SetAlertConfig(context.Background(), alertspb.AlertConfigDesc{
User: u,
RawConfig: cfg.AlertmanagerConfig,
})
require.NoError(t, err)
}

externalURL := flagext.URLValue{}
err := externalURL.Set("http://localhost:8080/alertmanager")
require.NoError(t, err)

// Create the Multitenant Alertmanager.
reg := prometheus.NewPedanticRegistry()
cfg := mockAlertmanagerConfig(t)
am, err := createMultitenantAlertmanager(cfg, nil, nil, alertStore, nil, log.NewNopLogger(), reg)
require.NoError(t, err)
require.NoError(t, services.StartAndAwaitRunning(context.Background(), am))
defer services.StopAndAwaitTerminated(context.Background(), am) //nolint:errcheck

err = am.loadAndSyncConfigs(context.Background(), reasonPeriodic)
require.NoError(t, err)
require.Len(t, am.alertmanagers, 2)

router := mux.NewRouter()
router.Path("/multitenant_alertmanager/configs").Methods(http.MethodGet).HandlerFunc(am.ListAllConfigs)
req := httptest.NewRequest("GET", "https://localhost:8080/multitenant_alertmanager/configs", nil)
w := httptest.NewRecorder()
router.ServeHTTP(w, req)

resp := w.Result()
require.Equal(t, http.StatusOK, resp.StatusCode)
require.Equal(t, "application/yaml", resp.Header.Get("Content-Type"))
body, err := ioutil.ReadAll(resp.Body)
require.NoError(t, err)
old, err := yaml.Marshal(testCases)
require.NoError(t, err)
require.YAMLEq(t, string(old), string(body))
}
4 changes: 4 additions & 0 deletions pkg/api/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ func (a *API) RegisterAlertmanager(am *alertmanager.MultitenantAlertmanager, tar
a.indexPage.AddLink(SectionAdminEndpoints, "/multitenant_alertmanager/ring", "Alertmanager Ring Status")
// Ensure this route is registered before the prefixed AM route
a.RegisterRoute("/multitenant_alertmanager/status", am.GetStatusHandler(), false, "GET")
a.RegisterRoute("/multitenant_alertmanager/configs", http.HandlerFunc(am.ListAllConfigs), false, "GET")
a.RegisterRoute("/multitenant_alertmanager/ring", http.HandlerFunc(am.RingHandler), false, "GET", "POST")
a.RegisterRoute("/multitenant_alertmanager/delete_tenant_config", http.HandlerFunc(am.DeleteUserConfig), true, "POST")

Expand Down Expand Up @@ -285,6 +286,9 @@ func (a *API) RegisterRuler(r *ruler.Ruler) {
// Legacy Ring Route
a.RegisterRoute("/ruler_ring", r, false, "GET", "POST")

// List all user rule groups
a.RegisterRoute("/ruler/rule_groups", http.HandlerFunc(r.ListAllRules), false, "GET")

ruler.RegisterRulerServer(a.server.GRPC, r)
}

Expand Down
45 changes: 44 additions & 1 deletion pkg/ruler/ruler.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ var (

const (
// Number of concurrent group list and group loads operations.
loadRulesConcurrency = 10
loadRulesConcurrency = 10
fetchRulesConcurrency = 16

rulerSyncReasonInitial = "initial"
rulerSyncReasonPeriodic = "periodic"
Expand All @@ -59,6 +60,9 @@ const (
// Limit errors
errMaxRuleGroupsPerUserLimitExceeded = "per-user rule groups limit (limit: %d actual: %d) exceeded"
errMaxRulesPerRuleGroupPerUserLimitExceeded = "per-user rules per rule group limit (limit: %d actual: %d) exceeded"

// errors
errListAllUser = "unable to list the ruler users"
)

// Config is the configuration for the recording rules server.
Expand Down Expand Up @@ -840,3 +844,42 @@ func (r *Ruler) DeleteTenantConfiguration(w http.ResponseWriter, req *http.Reque
level.Info(logger).Log("msg", "deleted all tenant rule groups", "user", userID)
w.WriteHeader(http.StatusOK)
}

func (r *Ruler) ListAllRules(w http.ResponseWriter, req *http.Request) {
logger := util_log.WithContext(req.Context(), r.logger)

userIDs, err := r.store.ListAllUsers(req.Context())
if err != nil {
level.Error(logger).Log("msg", errListAllUser, "err", err)
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
return
}

done := make(chan struct{})
iter := make(chan interface{})

go func() {
util.StreamWriteYAMLResponse(w, iter, logger)
close(done)
}()

err = concurrency.ForEachUser(req.Context(), userIDs, fetchRulesConcurrency, func(ctx context.Context, userID string) error {
rg, err := r.store.ListRuleGroupsForUserAndNamespace(ctx, userID, "")
if err != nil {
return errors.Wrapf(err, "failed to fetch ruler config for user %s", userID)
}
data := map[string]map[string][]rulefmt.RuleGroup{userID: rg.Formatted()}

select {
case iter <- data:
case <-done: // stop early, if sending response has already finished
}

return nil
})
if err != nil {
level.Error(logger).Log("msg", "failed to list all ruler configs", "err", err)
}
close(iter)
<-done
}
33 changes: 33 additions & 0 deletions pkg/ruler/ruler_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (

"github.com/go-kit/kit/log"
"github.com/go-kit/kit/log/level"
"github.com/gorilla/mux"
"github.com/prometheus/client_golang/prometheus"
prom_testutil "github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/prometheus/notifier"
Expand All @@ -28,6 +29,7 @@ import (
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"github.com/weaveworks/common/user"
"gopkg.in/yaml.v2"

"github.com/cortexproject/cortex/pkg/chunk"
"github.com/cortexproject/cortex/pkg/cortexpb"
Expand Down Expand Up @@ -843,6 +845,37 @@ type ruleGroupKey struct {
user, namespace, group string
}

func TestRuler_ListAllRules(t *testing.T) {
cfg, cleanup := defaultRulerConfig(newMockRuleStore(mockRules))
defer cleanup()

r, rcleanup := newTestRuler(t, cfg)
defer rcleanup()
defer services.StopAndAwaitTerminated(context.Background(), r) //nolint:errcheck

router := mux.NewRouter()
router.Path("/ruler/rule_groups").Methods(http.MethodGet).HandlerFunc(r.ListAllRules)

req := requestFor(t, http.MethodGet, "https://localhost:8080/ruler/rule_groups", nil, "")
w := httptest.NewRecorder()
router.ServeHTTP(w, req)

resp := w.Result()
body, _ := ioutil.ReadAll(resp.Body)

// Check status code and header
require.Equal(t, http.StatusOK, resp.StatusCode)
require.Equal(t, "application/yaml", resp.Header.Get("Content-Type"))

gs := make(map[string]map[string][]rulefmt.RuleGroup) // user:namespace:[]rulefmt.RuleGroup
for userID := range mockRules {
gs[userID] = mockRules[userID].Formatted()
}
expectedResponse, err := yaml.Marshal(gs)
require.NoError(t, err)
require.YAMLEq(t, string(expectedResponse), string(body))
}

type senderFunc func(alerts ...*notifier.Alert)

func (s senderFunc) Send(alerts ...*notifier.Alert) {
Expand Down
Loading