Skip to content

Commit a0eb041

Browse files
committed
feat: added thanos check rules command
Signed-off-by: Martin Chodur <[email protected]>
1 parent 44efe38 commit a0eb041

File tree

9 files changed

+297
-23
lines changed

9 files changed

+297
-23
lines changed

CHANGELOG.md

+25-22
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ We use *breaking* word for marking changes that are not backward compatible (rel
1111

1212
## Unreleased
1313

14+
### Added
15+
- [#1097](https://github.com/improbable-eng/thanos/pull/1097) Added `thanos check rules` linter for Thanos rule rules files.
16+
1417
## [v0.4.0](https://github.com/improbable-eng/thanos/releases/tag/v0.4.0) - 2019.05.3
1518

1619
:warning: **IMPORTANT** :warning: This is the last release that supports gossip. From Thanos v0.5.0, gossip will be completely removed.
@@ -33,28 +36,28 @@ See [this](docs/proposals/approved/201809_gossip-removal.md) for more details.
3336
New options:
3437

3538
New Store flags:
36-
39+
3740
* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
3841
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.
3942

4043
New Store metrics:
41-
44+
4245
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
4346
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
4447
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
4548
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.
46-
49+
4750
New Store tracing span:
4851
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate.
49-
50-
- [#1016](https://github.com/improbable-eng/thanos/pull/1016) Added option for another DNS resolver (miekg/dns client).
52+
53+
- [#1016](https://github.com/improbable-eng/thanos/pull/1016) Added option for another DNS resolver (miekg/dns client).
5154
Note that this is required to have SRV resolution working on [Golang 1.11+ with KubeDNS below v1.14](https://github.com/golang/go/issues/27546)
5255

5356
New Querier and Ruler flag: `-- store.sd-dns-resolver` which allows to specify resolver to use. Either `golang` or `miekgdns`
54-
57+
5558
- [#986](https://github.com/improbable-eng/thanos/pull/986) Allow to save some startup & sync time in store gateway as it is no longer needed to compute index-cache from block index on its own for larger blocks.
56-
The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
57-
In the same time, compactor precomputes the index cache file on every compaction.
59+
The store Gateway still can do it, but it first checks bucket if there is index-cached uploaded already.
60+
In the same time, compactor precomputes the index cache file on every compaction.
5861

5962
New Compactor flag: `--index.generate-missing-cache-file` was added to allow quicker addition of index cache files. If enabled it precomputes missing files on compactor startup. Note that it will take time and it's only one-off step per bucket.
6063

@@ -71,31 +74,31 @@ Note that this is required to have SRV resolution working on [Golang 1.11+ with
7174
- [#1021](https://github.com/improbable-eng/thanos/pull/1021) Query API `series` now supports POST method.
7275
- [#939](https://github.com/improbable-eng/thanos/pull/939) Query API `query_range` now supports POST method.
7376

74-
### Changed
77+
### Changed
7578

7679
- [#970](https://github.com/improbable-eng/thanos/pull/970) Deprecated `partial_response_disabled` proto field. Added `partial_response_strategy` instead. Both in gRPC and Query API.
7780
No `PartialResponseStrategy` field for `RuleGroups` by default means `abort` strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.
7881

7982
Metrics:
80-
83+
8184
* Added `thanos_rule_evaluation_with_warnings_total` to Ruler.
8285
* DNS `thanos_ruler_query_apis*` are now `thanos_ruler_query_apis_*` for consistency.
8386
* DNS `thanos_querier_store_apis*` are now `thanos_querier_store_apis__*` for consistency.
8487
* Query Gate `thanos_bucket_store_series*` are now `thanos_bucket_store_series_*` for consistency.
8588
* Most of thanos ruler metris related to rule manager has `strategy` label.
86-
89+
8790
Ruler tracing spans:
88-
91+
8992
* `/rule_instant_query HTTP[client]` is now `/rule_instant_query_part_resp_abort HTTP[client]"` if request is for abort strategy.
90-
93+
9194
- [#1009](https://github.com/improbable-eng/thanos/pull/1009): Upgraded Prometheus (~v2.7.0-rc.0 to v2.8.1) and TSDB (`v0.4.0` to `v0.6.1`) deps.
92-
95+
9396
Changes that affects Thanos:
94-
* query:
95-
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
96-
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
97+
* query:
98+
* [ENHANCEMENT] In histogram_quantile merge buckets with equivalent le values. #5158.
99+
* [ENHANCEMENT] Show list of offending labels in the error message in many-to-many scenarios. #5189
97100
* [BUGFIX] Fix panic when aggregator param is not a literal. #5290
98-
* ruler:
101+
* ruler:
99102
* [ENHANCEMENT] Reduce time that Alertmanagers are in flux when reloaded. #5126
100103
* [BUGFIX] prometheus_rule_group_last_evaluation_timestamp_seconds is now a unix timestamp. #5186
101104
* [BUGFIX] prometheus_rule_group_last_duration_seconds now reports seconds instead of nanoseconds. Fixes our [issue #1027](https://github.com/improbable-eng/thanos/issues/1027)
@@ -107,26 +110,26 @@ Note that this is required to have SRV resolution working on [Golang 1.11+ with
107110
* [CHANGE] Renamed flag `--sync-delay` to `--consistency-delay` [#1053](https://github.com/improbable-eng/thanos/pull/1053)
108111

109112
For ruler essentially whole TSDB CHANGELOG applies beween v0.4.0-v0.6.1: https://github.com/prometheus/tsdb/blob/master/CHANGELOG.md
110-
113+
111114
Note that this was added on TSDB and Prometheus: [FEATURE] Time-ovelapping blocks are now allowed. #370
112115
Whoever due to nature of Thanos compaction (distributed systems), for safety reason this is disabled for Thanos compactor for now.
113116

114117
- [#868](https://github.com/improbable-eng/thanos/pull/868) Go has been updated to 1.12.
115-
- [#1055](https://github.com/improbable-eng/thanos/pull/1055) Gossip flags are now disabled by default and deprecated.
118+
- [#1055](https://github.com/improbable-eng/thanos/pull/1055) Gossip flags are now disabled by default and deprecated.
116119
- [#964](https://github.com/improbable-eng/thanos/pull/964) repair: Repair process now sorts the series and labels within block.
117120
- [#1073](https://github.com/improbable-eng/thanos/pull/1073) Store: index cache for requests. It now calculates the size properly (includes slice header), has anti-deadlock safeguard and reports more metrics.
118121

119122
### Fixed
120123

121124
- [#921](https://github.com/improbable-eng/thanos/pull/921) `thanos_objstore_bucket_last_successful_upload_time` now does not appear when no blocks have been uploaded so far.
122-
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`
125+
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`
123126
- [#848](https://github.com/improbable-eng/thanos/pull/848) Compact: now correctly works with time series with duplicate labels.
124127
- [#894](https://github.com/improbable-eng/thanos/pull/894) Thanos Rule: UI now correctly shows evaluation time.
125128
- [#865](https://github.com/improbable-eng/thanos/pull/865) Query: now properly parses DNS SRV Service Discovery.
126129
- [#889](https://github.com/improbable-eng/thanos/pull/889) Store: added safeguard against merging posting groups segfault
127130
- [#941](https://github.com/improbable-eng/thanos/pull/941) Sidecar: added better handling of intermediate restarts.
128131
- [#933](https://github.com/improbable-eng/thanos/pull/933) Query: Fixed 30 seconds lag of adding new store to query.
129-
- [#962](https://github.com/improbable-eng/thanos/pull/962) Sidecar: Make config reloader file writes atomic.
132+
- [#962](https://github.com/improbable-eng/thanos/pull/962) Sidecar: Make config reloader file writes atomic.
130133
- [#982](https://github.com/improbable-eng/thanos/pull/982) Query: now advertises Min & Max Time accordingly to the nodes.
131134
- [#1041](https://github.com/improbable-eng/thanos/issues/1038) Ruler is now able to return long time range queries.
132135
- [#904](https://github.com/improbable-eng/thanos/pull/904) Compact: Skip compaction for blocks with no samples.

cmd/thanos/check.go

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
package main
2+
3+
import (
4+
"io/ioutil"
5+
6+
"github.com/go-kit/kit/log"
7+
"github.com/go-kit/kit/log/level"
8+
thanosrule "github.com/improbable-eng/thanos/pkg/rule"
9+
"github.com/oklog/run"
10+
"github.com/opentracing/opentracing-go"
11+
"github.com/prometheus/client_golang/prometheus"
12+
"github.com/prometheus/prometheus/pkg/rulefmt"
13+
"github.com/prometheus/tsdb"
14+
"gopkg.in/alecthomas/kingpin.v2"
15+
"gopkg.in/yaml.v2"
16+
)
17+
18+
func registerChecks(m map[string]setupFunc, app *kingpin.Application, name string) {
19+
cmd := app.Command(name, "Linting tools for Thanos")
20+
registerCheckRules(m, cmd, name)
21+
}
22+
23+
func registerCheckRules(m map[string]setupFunc, root *kingpin.CmdClause, name string) {
24+
checkRulesCmd := root.Command("rules", "Check if the rule files are valid or not.")
25+
ruleFiles := checkRulesCmd.Arg(
26+
"rule-files",
27+
"The rule files to check.",
28+
).Required().ExistingFiles()
29+
30+
m[name+" rules"] = func(g *run.Group, logger log.Logger, reg *prometheus.Registry, _ opentracing.Tracer, _ bool) error {
31+
// Dummy actor to immediately kill the group after the run function returns.
32+
g.Add(func() error { return nil }, func(error) {})
33+
return checkRulesFiles(logger, ruleFiles)
34+
}
35+
}
36+
37+
func checkRulesFiles(logger log.Logger, files *[]string) error {
38+
failed := tsdb.MultiError{}
39+
40+
for _, f := range *files {
41+
n, errs := checkRules(logger, f)
42+
if errs.Err() != nil {
43+
level.Error(logger).Log("result", "FAILED")
44+
for _, e := range errs {
45+
level.Error(logger).Log("error", e.Error())
46+
failed.Add(e)
47+
}
48+
level.Info(logger).Log()
49+
continue
50+
}
51+
level.Info(logger).Log("result", "SUCCESS", "rules found", n)
52+
}
53+
if failed.Err() != nil {
54+
return failed
55+
}
56+
return nil
57+
}
58+
59+
func checkRules(logger log.Logger, filename string) (int, tsdb.MultiError) {
60+
level.Info(logger).Log("msg", "checking", "filename", filename)
61+
checkErrors := tsdb.MultiError{}
62+
63+
b, err := ioutil.ReadFile(filename)
64+
if err != nil {
65+
checkErrors.Add(err)
66+
return 0, checkErrors
67+
}
68+
69+
var rgs thanosrule.RuleGroups
70+
if err := yaml.Unmarshal(b, &rgs); err != nil {
71+
checkErrors.Add(err)
72+
return 0, checkErrors
73+
}
74+
75+
// We need to convert Thanos rules to Prometheus rules so we can use their validation.
76+
promRgs := thanosRuleGroupsToPromRuleGroups(rgs)
77+
if errs := promRgs.Validate(); errs != nil {
78+
for _, e := range errs {
79+
checkErrors.Add(e)
80+
}
81+
return 0, checkErrors
82+
}
83+
84+
numRules := 0
85+
for _, rg := range rgs.Groups {
86+
numRules += len(rg.Rules)
87+
}
88+
89+
return numRules, checkErrors
90+
}
91+
92+
func thanosRuleGroupsToPromRuleGroups(ruleGroups thanosrule.RuleGroups) rulefmt.RuleGroups {
93+
promRuleGroups := rulefmt.RuleGroups{Groups: []rulefmt.RuleGroup{}}
94+
for _, g := range ruleGroups.Groups {
95+
group := rulefmt.RuleGroup{
96+
Name: g.Name,
97+
Interval: g.Interval,
98+
Rules: []rulefmt.Rule{},
99+
}
100+
for _, r := range g.Rules {
101+
group.Rules = append(
102+
group.Rules,
103+
rulefmt.Rule{
104+
Record: r.Record,
105+
Alert: r.Alert,
106+
Expr: r.Expr,
107+
For: r.For,
108+
Labels: r.Labels,
109+
Annotations: r.Annotations,
110+
},
111+
)
112+
}
113+
promRuleGroups.Groups = append(
114+
promRuleGroups.Groups,
115+
group,
116+
)
117+
}
118+
return promRuleGroups
119+
}

cmd/thanos/check_test.go

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
package main
2+
3+
import (
4+
"testing"
5+
6+
"github.com/go-kit/kit/log"
7+
"github.com/improbable-eng/thanos/pkg/testutil"
8+
)
9+
10+
func Test_checkRules(t *testing.T) {
11+
12+
validFiles := []string{
13+
"./testdata/rules-files/valid.yaml",
14+
}
15+
16+
invalidFiles := [][]string{
17+
[]string{"./testdata/rules-files/non-existing-file.yaml"},
18+
[]string{"./testdata/rules-files/invalid-yaml-format.yaml"},
19+
[]string{"./testdata/rules-files/invalid-rules-data.yaml"},
20+
}
21+
22+
logger := log.NewNopLogger()
23+
24+
testutil.Ok(t, checkRulesFiles(logger, &validFiles))
25+
26+
for _, fn := range invalidFiles {
27+
testutil.NotOk(t, checkRulesFiles(logger, &fn))
28+
}
29+
}

cmd/thanos/main.go

+1
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ func main() {
7979
registerBucket(cmds, app, "bucket")
8080
registerDownsample(cmds, app, "downsample")
8181
registerReceive(cmds, app, "receive")
82+
registerChecks(cmds, app, "check")
8283

8384
cmd, err := app.Parse(os.Args[1:])
8485
if err != nil {
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
groups:
2+
- name: null
3+
partial_response_strategy: "warn"
4+
interval: 2m
5+
rules:
6+
- alert: TestAlert
7+
partial_response_strategy: "warn"
8+
expr: 1
9+
labels:
10+
key: value
11+
annotations:
12+
key: value
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
groups:
2+
- name: test
3+
invalid_yaml_reason
+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
groups:
2+
- name: test-alert-group
3+
partial_response_strategy: "warn"
4+
interval: 2m
5+
rules:
6+
- alert: TestAlert
7+
partial_response_strategy: "warn"
8+
expr: 1
9+
labels:
10+
key: value
11+
annotations:
12+
key: value
13+
14+
- name: test-rule-group
15+
partial_response_strategy: "warn"
16+
interval: 2m
17+
rules:
18+
- record: test_metric
19+
expr: 1
20+
partial_response_strategy: "warn"

docs/components/check.md

+82
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
---
2+
title: Check
3+
type: docs
4+
menu: components
5+
---
6+
7+
# Check
8+
9+
The check component contains tools for validation of Prometheus rules.
10+
11+
## Deployment
12+
## Flags
13+
14+
[embedmd]:# (flags/check.txt $)
15+
```$
16+
usage: thanos check <command> [<args> ...]
17+
18+
Linting tools for Thanos
19+
20+
Flags:
21+
-h, --help Show context-sensitive help (also try --help-long and
22+
--help-man).
23+
--version Show application version.
24+
--log.level=info Log filtering level.
25+
--log.format=logfmt Log format to use.
26+
--gcloudtrace.project=GCLOUDTRACE.PROJECT
27+
GCP project to send Google Cloud Trace tracings to.
28+
If empty, tracing will be disabled.
29+
--gcloudtrace.sample-factor=1
30+
How often we send traces (1/<sample-factor>). If 0 no
31+
trace will be sent periodically, unless forced by
32+
baggage item. See `pkg/tracing/tracing.go` for
33+
details.
34+
35+
Subcommands:
36+
check rules <rule-files>...
37+
Check if the rule files are valid or not.
38+
39+
40+
```
41+
42+
43+
### Verify
44+
45+
`check rules` checks the Prometheus rules, used by the Thanos rule node, if they are valid.
46+
The check should be equivalent for the `promtool check rules` but that cannot be used because
47+
Thanos rule has extended rules file syntax, which includes `partial_response_strategy` field
48+
which `promtool` does not allow.
49+
50+
If the check fails the command fails with exit code `1`, otherwise `0`.
51+
52+
Example:
53+
54+
```
55+
$ ./thanos check rules cmd/thanos/testdata/rules-files/*.yaml
56+
```
57+
58+
[embedmd]:# (flags/check_rules.txt)
59+
```txt
60+
usage: thanos check rules <rule-files>...
61+
62+
Check if the rule files are valid or not.
63+
64+
Flags:
65+
-h, --help Show context-sensitive help (also try --help-long and
66+
--help-man).
67+
--version Show application version.
68+
--log.level=info Log filtering level.
69+
--log.format=logfmt Log format to use.
70+
--gcloudtrace.project=GCLOUDTRACE.PROJECT
71+
GCP project to send Google Cloud Trace tracings to.
72+
If empty, tracing will be disabled.
73+
--gcloudtrace.sample-factor=1
74+
How often we send traces (1/<sample-factor>). If 0 no
75+
trace will be sent periodically, unless forced by
76+
baggage item. See `pkg/tracing/tracing.go` for
77+
details.
78+
79+
Args:
80+
<rule-files> The rule files to check.
81+
82+
```

0 commit comments

Comments
 (0)