Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Stale alert manager templates files #4495

Merged
merged 5 commits into from
Oct 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
* [BUGFIX] Querier: fixed panic when querying exemplars and using `-distributor.shard-by-all-labels=false`. #4473
* [BUGFIX] Querier: honor querier minT,maxT if `nil` SelectHints are passed to Select(). #4413
* [BUGFIX] Compactor: fixed panic while collecting Prometheus metrics. #4483
* [BUGFIX] AlertManager: remove stale template files. #4495


## 1.10.0 / 2021-08-03
Expand Down
23 changes: 21 additions & 2 deletions pkg/alertmanager/multitenant.go
Original file line number Diff line number Diff line change
Expand Up @@ -825,14 +825,25 @@ func (am *MultitenantAlertmanager) setConfig(cfg alertspb.AlertConfigDesc) error
var userAmConfig *amconfig.Config
var err error
var hasTemplateChanges bool
var userTemplateDir = filepath.Join(am.getTenantDirectory(cfg.User), templatesDir)
var pathsToRemove = make(map[string]struct{})

// List existing files to keep track the ones to be removed
if oldTemplateFiles, err := ioutil.ReadDir(userTemplateDir); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a comment like "first we list all the files, then we will remove those from the list that are in use, then we will remove the files that are never used".

Copy link
Member Author

@alanprot alanprot Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any danger that there could be a file there for some other reason?

I dont think so. Cortex expose an api to configure alert manager for a tenant and this API receives the template files and the prometheus AlertManager configuration on the same request. On a regular prometheus alert manager, the config file are stored in the filesystem and referenced by the configuration. Cortex implementation receives both on the same request, drops the templates files on the filesystem and reference them on the alert manager config. The folder where cortex drops the file is exclusive for a given tenant/alert manager.

Could someone remove a template, then decide to use it again but you deleted it?

As we configure the templates and the alert manager on the same request, we should never have a alert manager configuration that uses a template that is not defined in the same request.

I think this deserves a comment like "first we list all the files, then we will remove those from the list that are in use, then we will remove the files that are never used".

Make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comments

for _, file := range oldTemplateFiles {
pathsToRemove[filepath.Join(userTemplateDir, file.Name())] = struct{}{}
}
}

for _, tmpl := range cfg.Templates {
templateFilepath, err := safeTemplateFilepath(filepath.Join(am.getTenantDirectory(cfg.User), templatesDir), tmpl.Filename)
templateFilePath, err := safeTemplateFilepath(userTemplateDir, tmpl.Filename)
if err != nil {
return err
}

hasChanged, err := storeTemplateFile(templateFilepath, tmpl.Body)
// Removing from pathsToRemove map the files that still exists in the config
delete(pathsToRemove, templateFilePath)
hasChanged, err := storeTemplateFile(templateFilePath, tmpl.Body)
if err != nil {
return err
}
Expand All @@ -842,6 +853,14 @@ func (am *MultitenantAlertmanager) setConfig(cfg alertspb.AlertConfigDesc) error
}
}

for pathToRemove := range pathsToRemove {
err := os.Remove(pathToRemove)
if err != nil {
level.Warn(am.logger).Log("msg", "failed to remove file", "file", pathToRemove, "err", err)
}
hasTemplateChanges = true
}

level.Debug(am.logger).Log("msg", "setting config", "user", cfg.User)

am.alertmanagersMtx.Lock()
Expand Down
17 changes: 17 additions & 0 deletions pkg/alertmanager/multitenant_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,23 @@ templates:
cortex_alertmanager_config_last_reload_successful{user="user2"} 1
cortex_alertmanager_config_last_reload_successful{user="user3"} 1
`), "cortex_alertmanager_config_last_reload_successful"))

// Removed template files should be cleaned up
user3Cfg.Templates = []*alertspb.TemplateDesc{
{
Filename: "first.tpl",
Body: `{{ define "t1" }}Template 1 ... {{end}}`,
},
}

require.NoError(t, store.SetAlertConfig(ctx, user3Cfg))

err = am.loadAndSyncConfigs(context.Background(), reasonPeriodic)
require.NoError(t, err)

require.True(t, dirExists(t, user3Dir))
require.True(t, fileExists(t, filepath.Join(user3Dir, templatesDir, "first.tpl")))
require.False(t, fileExists(t, filepath.Join(user3Dir, templatesDir, "second.tpl")))
}

func TestMultitenantAlertmanager_FirewallShouldBlockHTTPBasedReceiversWhenEnabled(t *testing.T) {
Expand Down