Skip to content

Commit 206c9cd

Browse files
author
Allenzhli
authored
add api to list all alertmanager configs and rule groups (#3529)
* add test Signed-off-by: allenzhli <[email protected]> * fix api_test Signed-off-by: allenzhli <[email protected]> * fix Signed-off-by: allenzhli <[email protected]> * make clean-white-noise Signed-off-by: allenzhli <[email protected]> * remove merge conflict marks Signed-off-by: allenzhli <[email protected]> * update reviews Signed-off-by: allenzhli <[email protected]> * fix test error Signed-off-by: allenzhli <[email protected]> * improve list all user configs Signed-off-by: allenzhli <[email protected]> * fix Signed-off-by: allenzhli <[email protected]> * remove RespIterator and update to use chan idiom Signed-off-by: allenzhli <[email protected]> * del http.Flusher Signed-off-by: allenzhli <[email protected]> * update to StreamWriteYAMLResponse Signed-off-by: allenzhli <[email protected]> * update review comments Signed-off-by: allenzhli <[email protected]> * del .vscode and fix typos Signed-off-by: allenzhli <[email protected]> * Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> Update pkg/ruler/ruler_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/util/http_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api_test.go Co-authored-by: Marco Pracucci <[email protected]> Update pkg/alertmanager/api.go Co-authored-by: Marco Pracucci <[email protected]> fix review comments Signed-off-by: allenzhli <[email protected]>
1 parent 6cade75 commit 206c9cd

File tree

9 files changed

+315
-1
lines changed

9 files changed

+315
-1
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@
2727
* [ENHANCEMENT] Ruler: added the following metrics when ruler sharding is enabled: #3916
2828
* `cortex_ruler_clients`
2929
* `cortex_ruler_client_request_duration_seconds`
30+
* [ENHANCEMENT] Alertmanager: Add API endpoint to list all tenant alertmanager configs: `GET /multitenant_alertmanager/configs`. #3529
31+
* [ENHANCEMENT] Ruler: Add API endpoint to list all tenant ruler rule groups: `GET /ruler/rule_groups`. #3529
3032
* [ENHANCEMENT] Query-frontend/scheduler: added querier forget delay (`-query-frontend.querier-forget-delay` and `-query-scheduler.querier-forget-delay`) to mitigate the blast radius in the event queriers crash because of a repeatedly sent "query of death" when shuffle-sharding is enabled. #3901
3133
* [ENHANCEMENT] Query-frontend: reduced memory allocations when serializing query response. #3964
3234
* [ENHANCEMENT] Querier / ruler: some optimizations to PromQL query engine. #3934 #3989

docs/api/_index.md

+18
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
4141
| [Get tenant ingestion stats](#get-tenant-ingestion-stats) | Querier | `GET /api/v1/user_stats` |
4242
| [Get tenant chunks](#get-tenant-chunks) | Querier | `GET /api/v1/chunks` |
4343
| [Ruler ring status](#ruler-ring-status) | Ruler | `GET /ruler/ring` |
44+
| [Ruler rules ](#ruler-rule-groups) | Ruler | `GET /ruler/rule_groups` |
4445
| [List rules](#list-rules) | Ruler | `GET <prometheus-http-prefix>/api/v1/rules` |
4546
| [List alerts](#list-alerts) | Ruler | `GET <prometheus-http-prefix>/api/v1/alerts` |
4647
| [List rule groups](#list-rule-groups) | Ruler | `GET /api/v1/rules` |
@@ -51,6 +52,7 @@ For the sake of clarity, in this document we have grouped API endpoints by servi
5152
| [Delete namespace](#delete-namespace) | Ruler | `DELETE /api/v1/rules/{namespace}` |
5253
| [Delete tenant configuration](#delete-tenant-configuration) | Ruler | `POST /ruler/delete_tenant_config` |
5354
| [Alertmanager status](#alertmanager-status) | Alertmanager | `GET /multitenant_alertmanager/status` |
55+
| [Alertmanager configs](#alertmanager-configs) | Alertmanager | `GET /multitenant_alertmanager/configs` |
5456
| [Alertmanager ring status](#alertmanager-ring-status) | Alertmanager | `GET /multitenant_alertmanager/ring` |
5557
| [Alertmanager UI](#alertmanager-ui) | Alertmanager | `GET /<alertmanager-http-prefix>` |
5658
| [Alertmanager Delete Tenant Configuration](#alertmanager-delete-tenant-configuration) | Alertmanager | `POST /multitenant_alertmanager/delete_tenant_config` |
@@ -435,6 +437,14 @@ GET /ruler_ring
435437

436438
Displays a web page with the ruler hash ring status, including the state, healthy and last heartbeat time of each ruler.
437439

440+
### Ruler rules
441+
442+
```
443+
GET /ruler/rule_groups
444+
```
445+
446+
List all tenant rules. This endpoint is not part of ruler-API and is always available regardless of whether ruler-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the rule groups for each tenant and `200` status code on success.
447+
438448
### List rules
439449

440450
```
@@ -667,6 +677,14 @@ GET /status
667677

668678
Displays a web page with the current status of the Alertmanager, including the Alertmanager cluster members.
669679

680+
### Alertmanager configs
681+
682+
```
683+
GET /multitenant_alertmanager/configs
684+
```
685+
686+
List all Alertmanager configurations. This endpoint is not part of alertmanager-API and is always available regardless of whether alertmanager-API is enabled or not. It should not be exposed to end users. This endpoint returns a YAML dictionary with all the Alertmanager configurations and `200` status code on success.
687+
670688
### Alertmanager ring status
671689

672690
```

pkg/alertmanager/api.go

+53
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,19 @@
11
package alertmanager
22

33
import (
4+
"context"
45
"fmt"
56
"io/ioutil"
67
"net/http"
78
"os"
89
"path/filepath"
910

11+
"github.com/pkg/errors"
12+
1013
"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
1114
"github.com/cortexproject/cortex/pkg/tenant"
15+
"github.com/cortexproject/cortex/pkg/util"
16+
"github.com/cortexproject/cortex/pkg/util/concurrency"
1217
util_log "github.com/cortexproject/cortex/pkg/util/log"
1318

1419
"github.com/go-kit/kit/log"
@@ -25,6 +30,9 @@ const (
2530
errStoringConfiguration = "unable to store the Alertmanager config"
2631
errDeletingConfiguration = "unable to delete the Alertmanager config"
2732
errNoOrgID = "unable to determine the OrgID"
33+
errListAllUser = "unable to list the Alertmanager users"
34+
35+
fetchConcurrency = 16
2836
)
2937

3038
// UserConfig is used to communicate a users alertmanager configs
@@ -184,3 +192,48 @@ func validateUserConfig(logger log.Logger, cfg alertspb.AlertConfigDesc) error {
184192

185193
return nil
186194
}
195+
196+
func (am *MultitenantAlertmanager) ListAllConfigs(w http.ResponseWriter, r *http.Request) {
197+
logger := util_log.WithContext(r.Context(), am.logger)
198+
userIDs, err := am.store.ListAllUsers(r.Context())
199+
if err != nil {
200+
level.Error(logger).Log("msg", "failed to list users of alertmanager", "err", err)
201+
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
202+
return
203+
}
204+
205+
done := make(chan struct{})
206+
iter := make(chan interface{})
207+
208+
go func() {
209+
util.StreamWriteYAMLResponse(w, iter, logger)
210+
close(done)
211+
}()
212+
213+
err = concurrency.ForEachUser(r.Context(), userIDs, fetchConcurrency, func(ctx context.Context, userID string) error {
214+
cfg, err := am.store.GetAlertConfig(ctx, userID)
215+
if errors.Is(err, alertspb.ErrNotFound) {
216+
return nil
217+
} else if err != nil {
218+
return errors.Wrapf(err, "failed to fetch alertmanager config for user %s", userID)
219+
}
220+
data := map[string]*UserConfig{
221+
userID: {
222+
TemplateFiles: alertspb.ParseTemplates(cfg),
223+
AlertmanagerConfig: cfg.RawConfig,
224+
},
225+
}
226+
227+
select {
228+
case iter <- data:
229+
case <-done: // stop early, if sending response has already finished
230+
}
231+
232+
return nil
233+
})
234+
if err != nil {
235+
level.Error(logger).Log("msg", "failed to list all alertmanager configs", "err", err)
236+
}
237+
close(iter)
238+
<-done
239+
}

pkg/alertmanager/api_test.go

+90
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,18 @@ import (
1010
"testing"
1111

1212
"github.com/go-kit/kit/log"
13+
"github.com/gorilla/mux"
14+
"github.com/prometheus/client_golang/prometheus"
15+
"gopkg.in/yaml.v2"
16+
1317
"github.com/thanos-io/thanos/pkg/objstore"
1418

19+
"github.com/cortexproject/cortex/pkg/util/flagext"
20+
1521
"github.com/cortexproject/cortex/pkg/alertmanager/alertspb"
1622
"github.com/cortexproject/cortex/pkg/alertmanager/alertstore/bucketclient"
1723
util_log "github.com/cortexproject/cortex/pkg/util/log"
24+
"github.com/cortexproject/cortex/pkg/util/services"
1825

1926
"github.com/stretchr/testify/require"
2027
"github.com/weaveworks/common/user"
@@ -199,3 +206,86 @@ func TestMultitenantAlertmanager_DeleteUserConfig(t *testing.T) {
199206
require.Equal(t, 0, len(storage.Objects()))
200207
}
201208
}
209+
210+
func TestAMConfigListUserConfig(t *testing.T) {
211+
testCases := map[string]*UserConfig{
212+
"user1": {
213+
AlertmanagerConfig: `
214+
global:
215+
resolve_timeout: 5m
216+
route:
217+
receiver: route1
218+
group_by:
219+
- '...'
220+
continue: false
221+
receivers:
222+
- name: route1
223+
webhook_configs:
224+
- send_resolved: true
225+
http_config: {}
226+
url: http://alertmanager/api/notifications?orgId=1&rrid=7
227+
max_alerts: 0
228+
`,
229+
},
230+
"user2": {
231+
AlertmanagerConfig: `
232+
global:
233+
resolve_timeout: 5m
234+
route:
235+
receiver: route1
236+
group_by:
237+
- '...'
238+
continue: false
239+
receivers:
240+
- name: route1
241+
webhook_configs:
242+
- send_resolved: true
243+
http_config: {}
244+
url: http://alertmanager/api/notifications?orgId=2&rrid=7
245+
max_alerts: 0
246+
`,
247+
},
248+
}
249+
250+
storage := objstore.NewInMemBucket()
251+
alertStore := bucketclient.NewBucketAlertStore(storage, nil, log.NewNopLogger())
252+
253+
for u, cfg := range testCases {
254+
err := alertStore.SetAlertConfig(context.Background(), alertspb.AlertConfigDesc{
255+
User: u,
256+
RawConfig: cfg.AlertmanagerConfig,
257+
})
258+
require.NoError(t, err)
259+
}
260+
261+
externalURL := flagext.URLValue{}
262+
err := externalURL.Set("http://localhost:8080/alertmanager")
263+
require.NoError(t, err)
264+
265+
// Create the Multitenant Alertmanager.
266+
reg := prometheus.NewPedanticRegistry()
267+
cfg := mockAlertmanagerConfig(t)
268+
am, err := createMultitenantAlertmanager(cfg, nil, nil, alertStore, nil, log.NewNopLogger(), reg)
269+
require.NoError(t, err)
270+
require.NoError(t, services.StartAndAwaitRunning(context.Background(), am))
271+
defer services.StopAndAwaitTerminated(context.Background(), am) //nolint:errcheck
272+
273+
err = am.loadAndSyncConfigs(context.Background(), reasonPeriodic)
274+
require.NoError(t, err)
275+
require.Len(t, am.alertmanagers, 2)
276+
277+
router := mux.NewRouter()
278+
router.Path("/multitenant_alertmanager/configs").Methods(http.MethodGet).HandlerFunc(am.ListAllConfigs)
279+
req := httptest.NewRequest("GET", "https://localhost:8080/multitenant_alertmanager/configs", nil)
280+
w := httptest.NewRecorder()
281+
router.ServeHTTP(w, req)
282+
283+
resp := w.Result()
284+
require.Equal(t, http.StatusOK, resp.StatusCode)
285+
require.Equal(t, "application/yaml", resp.Header.Get("Content-Type"))
286+
body, err := ioutil.ReadAll(resp.Body)
287+
require.NoError(t, err)
288+
old, err := yaml.Marshal(testCases)
289+
require.NoError(t, err)
290+
require.YAMLEq(t, string(old), string(body))
291+
}

pkg/api/api.go

+4
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ func (a *API) RegisterAlertmanager(am *alertmanager.MultitenantAlertmanager, tar
168168
a.indexPage.AddLink(SectionAdminEndpoints, "/multitenant_alertmanager/ring", "Alertmanager Ring Status")
169169
// Ensure this route is registered before the prefixed AM route
170170
a.RegisterRoute("/multitenant_alertmanager/status", am.GetStatusHandler(), false, "GET")
171+
a.RegisterRoute("/multitenant_alertmanager/configs", http.HandlerFunc(am.ListAllConfigs), false, "GET")
171172
a.RegisterRoute("/multitenant_alertmanager/ring", http.HandlerFunc(am.RingHandler), false, "GET", "POST")
172173
a.RegisterRoute("/multitenant_alertmanager/delete_tenant_config", http.HandlerFunc(am.DeleteUserConfig), true, "POST")
173174

@@ -285,6 +286,9 @@ func (a *API) RegisterRuler(r *ruler.Ruler) {
285286
// Legacy Ring Route
286287
a.RegisterRoute("/ruler_ring", r, false, "GET", "POST")
287288

289+
// List all user rule groups
290+
a.RegisterRoute("/ruler/rule_groups", http.HandlerFunc(r.ListAllRules), false, "GET")
291+
288292
ruler.RegisterRulerServer(a.server.GRPC, r)
289293
}
290294

pkg/ruler/ruler.go

+44-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,8 @@ var (
5050

5151
const (
5252
// Number of concurrent group list and group loads operations.
53-
loadRulesConcurrency = 10
53+
loadRulesConcurrency = 10
54+
fetchRulesConcurrency = 16
5455

5556
rulerSyncReasonInitial = "initial"
5657
rulerSyncReasonPeriodic = "periodic"
@@ -59,6 +60,9 @@ const (
5960
// Limit errors
6061
errMaxRuleGroupsPerUserLimitExceeded = "per-user rule groups limit (limit: %d actual: %d) exceeded"
6162
errMaxRulesPerRuleGroupPerUserLimitExceeded = "per-user rules per rule group limit (limit: %d actual: %d) exceeded"
63+
64+
// errors
65+
errListAllUser = "unable to list the ruler users"
6266
)
6367

6468
// Config is the configuration for the recording rules server.
@@ -840,3 +844,42 @@ func (r *Ruler) DeleteTenantConfiguration(w http.ResponseWriter, req *http.Reque
840844
level.Info(logger).Log("msg", "deleted all tenant rule groups", "user", userID)
841845
w.WriteHeader(http.StatusOK)
842846
}
847+
848+
func (r *Ruler) ListAllRules(w http.ResponseWriter, req *http.Request) {
849+
logger := util_log.WithContext(req.Context(), r.logger)
850+
851+
userIDs, err := r.store.ListAllUsers(req.Context())
852+
if err != nil {
853+
level.Error(logger).Log("msg", errListAllUser, "err", err)
854+
http.Error(w, fmt.Sprintf("%s: %s", errListAllUser, err.Error()), http.StatusInternalServerError)
855+
return
856+
}
857+
858+
done := make(chan struct{})
859+
iter := make(chan interface{})
860+
861+
go func() {
862+
util.StreamWriteYAMLResponse(w, iter, logger)
863+
close(done)
864+
}()
865+
866+
err = concurrency.ForEachUser(req.Context(), userIDs, fetchRulesConcurrency, func(ctx context.Context, userID string) error {
867+
rg, err := r.store.ListRuleGroupsForUserAndNamespace(ctx, userID, "")
868+
if err != nil {
869+
return errors.Wrapf(err, "failed to fetch ruler config for user %s", userID)
870+
}
871+
data := map[string]map[string][]rulefmt.RuleGroup{userID: rg.Formatted()}
872+
873+
select {
874+
case iter <- data:
875+
case <-done: // stop early, if sending response has already finished
876+
}
877+
878+
return nil
879+
})
880+
if err != nil {
881+
level.Error(logger).Log("msg", "failed to list all ruler configs", "err", err)
882+
}
883+
close(iter)
884+
<-done
885+
}

pkg/ruler/ruler_test.go

+33
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ import (
1717

1818
"github.com/go-kit/kit/log"
1919
"github.com/go-kit/kit/log/level"
20+
"github.com/gorilla/mux"
2021
"github.com/prometheus/client_golang/prometheus"
2122
prom_testutil "github.com/prometheus/client_golang/prometheus/testutil"
2223
"github.com/prometheus/prometheus/notifier"
@@ -28,6 +29,7 @@ import (
2829
"github.com/stretchr/testify/assert"
2930
"github.com/stretchr/testify/require"
3031
"github.com/weaveworks/common/user"
32+
"gopkg.in/yaml.v2"
3133

3234
"github.com/cortexproject/cortex/pkg/chunk"
3335
"github.com/cortexproject/cortex/pkg/cortexpb"
@@ -843,6 +845,37 @@ type ruleGroupKey struct {
843845
user, namespace, group string
844846
}
845847

848+
func TestRuler_ListAllRules(t *testing.T) {
849+
cfg, cleanup := defaultRulerConfig(newMockRuleStore(mockRules))
850+
defer cleanup()
851+
852+
r, rcleanup := newTestRuler(t, cfg)
853+
defer rcleanup()
854+
defer services.StopAndAwaitTerminated(context.Background(), r) //nolint:errcheck
855+
856+
router := mux.NewRouter()
857+
router.Path("/ruler/rule_groups").Methods(http.MethodGet).HandlerFunc(r.ListAllRules)
858+
859+
req := requestFor(t, http.MethodGet, "https://localhost:8080/ruler/rule_groups", nil, "")
860+
w := httptest.NewRecorder()
861+
router.ServeHTTP(w, req)
862+
863+
resp := w.Result()
864+
body, _ := ioutil.ReadAll(resp.Body)
865+
866+
// Check status code and header
867+
require.Equal(t, http.StatusOK, resp.StatusCode)
868+
require.Equal(t, "application/yaml", resp.Header.Get("Content-Type"))
869+
870+
gs := make(map[string]map[string][]rulefmt.RuleGroup) // user:namespace:[]rulefmt.RuleGroup
871+
for userID := range mockRules {
872+
gs[userID] = mockRules[userID].Formatted()
873+
}
874+
expectedResponse, err := yaml.Marshal(gs)
875+
require.NoError(t, err)
876+
require.YAMLEq(t, string(expectedResponse), string(body))
877+
}
878+
846879
type senderFunc func(alerts ...*notifier.Alert)
847880

848881
func (s senderFunc) Send(alerts ...*notifier.Alert) {

0 commit comments

Comments
 (0)