-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Improve filtering configuration with globs and make match_type optional #1081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I was doing some refactoring around the filterset and found the current config hard to use as well. I do believe that it is an important optimization to know when a pattern is regexp or strict string comparison. So my proposed alternative is:
|
Only if it starts and ends with
How do you match You can see more examples here that we use in the SignalFx agent: https://docs.signalfx.com/en/latest/integrations/agent/filtering.html
You do know based on the syntax defined above of |
In the Smart Agent, we settled on the way Jay is proposing because it was easier to write and understand from the end user perspective, with less syntax and structure to remember. Performance-wise it can be written in such a way that it is no different. I'm not sure how necessary the negation is if you are going to keep the |
Yeah that's true, if you always have include and exclude sections it's probably not useful. I was thinking I saw some kind of filtering in OT that was just a single list without separate sections where I thought it might be useful but I forget where. Can always be added later if needed. I think we can drop it from the proposal for now. |
Is any standard that defines |
You mean what if you have a filter that's
@keitwb best I can tell you'd just have to use the regex in SA, am I missing anything? |
But for your initial example why did you use |
For demonstration purposes only :) Yes a better example would probably have been The vast majority of the times globs are sufficient and easier though: |
@jrcamp I am confused:
Then why do you need the first and third to be "different" I mean the are both glob matching. |
I don't know if there's an explicit standard that defines regex's to be enclosed in slashes, but its a very common convention (as Jay mentioned, this is the case for JavaScript, Perl, sed, etc) - see: https://en.wikipedia.org/wiki/Regular_expression#Delimiters
Glob is different from regex: https://en.wikipedia.org/wiki/Glob_(programming)#Compared_to_regular_expressions I quite like Jay's proposal for the ease of configuration, but a couple of questions/concerns:
include:
# matches "foo" OR "foobar"
names: [ "foo", "foobar" ]
include:
# matches "foo*" OR "!foobar" ==> matches "bar"
names: [ "foo*", "!foobar" ] In the code we could treat negation differently by taking the union of all regular entries, and then subtract the negation entries from that, but I feel that is more confusing than just keeping the include/exclude filters.
|
👍
Either regex or we allow * to be escaped (this means the escape has to be escaped as well). e.g.:
Ben made a good point, there's really no need for negation when you have the include/exclude concept. However, negation is an alternative to having include/exclude if we wanted. See our docs in case you haven't yet for more details and examples on the negation. However I'm ok keeping include/exclude as it makes backwards compat easier.
I think they would still fit under the |
Actually, using |
What I am hearing is that there are corner cases when it comes to unify the filter values in order to determine a strict match vs regexp match, so probably having them separated is not that bad (also for regexp we do have the caching option) should we keep them separated? Also when it comes to the standard that we use for regexp probably following the golang is the easiest for us. |
There are corner cases but do we choose to optimize for them or for the 99% case? I believe we should optimize for the 99% cases. |
I am 100% for user first, but having So because of that explicitly ask the user what to use is in my opinion more user friendly than having undefined behaviors or having our custom standard. |
Talked to Bogdan offline and the concern is having corner cases with values like My proposal building on top of @bogdandrutu's suggestion. I'm proposing separate regex config into its own metrics:
# include and/or exclude can be specified. However, the include properties
# are always checked before the exclude properties.
exclude:
# The metric name must match at least one of the items.
# This is an optional field.
meric_names:
regexp_config:
cache_enabled: true
regexp:
- host/cpu/.*
strict:
- host/cpu/usage
glob:
- host/* Backwards compatibility would be done by checking whether |
The UX of setting the match type explicitly here is nice. I agree with @bogdandrutu that a new standard, even if it is more intuitive, may not necessarily be more friendly to users. I do want to raise the concern here that allowing multiple filter types in the same filter will increase processing on the backend. We'd have to run all the incoming metrics/traces through each specified filter type. It'd be more performant for the user to write their |
I'm not I understand the performance concern. At startup time we "parse" the config and collect a set of rules that are to be evaluated whether they be strict, glob, or regex. I'm not sure what the extra overhead is when processing metrics? |
The new proposal looks good to me. A couple of minor thoughts:
i.e. consider include:
metric_names: vs metric_names:
include: |
Sorry, I should have clarified, I was mostly thinking about the case with caching matches enabled. For example, to support a mixed regexp & strict & glob filterprocessor, I could see some processing logic like:
The regexp filterset has an option to use a cache to prevent re-running through all the regexp filters right now, but then you have the additional overhead of running through strict and glob-type filters again (though they could also cache separately). In this case, it's probably best to have the caching option on the top-level of the filter config instead of just regexp_config, so you could do something like this:
|
What is the status of this? Do we still want to make changes to the filter config or are we okay with how it currently is? Is it possible to make these packages not internal so that they can be used in Contrib? |
Another Proposal: I looked at what a few other systems do. I didn't find anything super useful to be gleaned as most systems use filter configuration that is specific to the task at hand, and it seems relatively rare to offer strict, glob, and regex options. Anyway. here is an alternative to what Jay suggested where we keep the "type" property and change how include/exclude works to reduce the amount of nesting needed. Goal 1: Streamline the Config At the moment, the config can be a little verbose. Some simplifications we can make:
Proposal: include: <string or array>
exclude: <string or array>
type: <strict | glob | regexp> # strict by default
cache_enabled: <bool> # optional
cache_maxentries: <int> # only applies if cache_enabled = true Two limitations of this approach:
Simple case: Current: filter:
metrics:
include:
metric_names: "hello world"
match_type: strict With Proposal: filter:
metrics:
include: "hello world" Multiple filters: Current: hostmetrics:
scrapers:
filesystem:
include_devices:
devices: [ "^[A-Z]:$" ]
match_type: regexp
regexp_config:
cache_enabled: true
exclude_devices:
devices: [ "^D:$", "^E:$" ]
match_type: regexp
regexp_config:
cache_enabled: true
include_fs_types:
fs_types: "Ext2"
match_type: strict With Proposal: hostmetrics:
scrapers:
filesystem:
devices:
include: [ "^[A-Z]:$" ]
exclude: [ "^D:$", "^E:$" ]
match_type: regexp
cache_enabled: true
fs_types:
include: "Ext2" Goal 2: Simplify the Code to use the filter The current code requires quite a bit of additional scaffolding to use. The suggested new config should simplify this significantly:
Current: type Config struct {
MyFieldIncludeFilter MyFieldMatchConfig `mapstructure:",include_my_field"`
MyFieldExcludeFilter MyFieldMatchConfig `mapstructure:",exclude_my_field"`
}
type MyFieldMatchConfig struct {
filterset.Config `mapstructure:",squash"`
MyFields []string `mapstructure:"my_fields"`
}
...
# initialization code
includeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldIncludeFilter)
if err != nil {
return nil, err
}
excludeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldExcludeFilter)
if err != nil {
return nil, err
} With Proposal: type Config struct {
MyFieldFilter MatchConfig `mapstructure:",my_field"`
}
...
# initialization code
filter, err := filterset.CreateFilterSet(items, cfg.MyFieldFilter)
if err != nil {
return nil, err
} The Goal 3: Expose the filters for use externally Self explanatory. Blocks open-telemetry/opentelemetry-collector-contrib#1088 (comment) |
I think it's an improvement and I think the downsides are fine. Will we be able to keep backwards compatibility in a reasonable way or would it require breaking configs? |
It would require breaking configs 😦 But I don't think that's avoidable if we want to simplify the config |
@jrcamp @james-bebbington can Goal 3 of this proposal be done independently of the others in order to unblock issues open-telemetry/opentelemetry-collector-contrib#1088 and #1892? or do Goals 1 and 2 stand as prerequisites? edit: This was answered in #1892 |
…lemetry#1081) * Bump github.com/itchyny/gojq from 0.11.0 to 0.11.1 in /tools Bumps [github.com/itchyny/gojq](https://github.com/itchyny/gojq) from 0.11.0 to 0.11.1. - [Release notes](https://github.com/itchyny/gojq/releases) - [Changelog](https://github.com/itchyny/gojq/blob/master/CHANGELOG.md) - [Commits](itchyny/gojq@v0.11.0...v0.11.1) Signed-off-by: dependabot[bot] <[email protected]> * Auto-fix go.sum changes in dependent modules Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tyler Yahn <[email protected]>
…etry#1081) Bumps [requests](https://github.com/psf/requests) from 2.27.0 to 2.27.1. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.27.0...v2.27.1) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
This was originally brought up in #560 (comment) but was out of scope for the metric filtering PR.
Background
Today the type of matching pattern to use is done by explicitly setting match_type. For example:
This has some downsides in that you must have a different include/exclude section for each match_type when the metric names you are matching against may logically fit into one section. Additionally the current model does not support globs which are in many cases sufficient when regex are overkill.
Proposal
The proposal is to specify
match_type
as part of the match pattern. It uses preexisting conventions that are likely familiar to most users. Additionally I'm proposing adding negating support to make cases easy where users want to match a wide pattern (as it would be too many to enumerate manually) but then negate matches later.Globs are used extensively across many platforms, databases, shells, etc. The regex syntax of
/cpu/
is in many languages like JavaScript, Perl, sed, etc. I'm not sure how extensive!
is used but it is used by gitignore for pattern matching.Backwards Compatability
To keep existing configs backwards-compatible I'm proposing making
match_type
optional (it is currently require). If a user specifies amatch_type
value the current behavior will be maintained. Ifmatch_type
is not set then the new syntax described here will be used.The text was updated successfully, but these errors were encountered: