Improve filtering configuration with globs and make match_type optional #1081

jrcamp · 2020-06-04T15:36:35Z

This was originally brought up in #560 (comment) but was out of scope for the metric filtering PR.

Background

Today the type of matching pattern to use is done by explicitly setting match_type. For example:

        # metrics is the configuration for filtering metrics
        metrics:
            # include and/or exclude can be specified. However, the include properties
            # are always checked before the exclude properties.
            exclude:
                # match_type controls how items in "metric_names" arrays are
                # interpreted. Possible values are "regexp" or "strict".
                # This is a required field.
                match_type: regex

                # The metric name must match at least one of the items.
                # This is an optional field.
                meric_names: [metric1, metric2]

This has some downsides in that you must have a different include/exclude section for each match_type when the metric names you are matching against may logically fit into one section. Additionally the current model does not support globs which are in many cases sufficient when regex are overkill.

Proposal

The proposal is to specify match_type as part of the match pattern. It uses preexisting conventions that are likely familiar to most users. Additionally I'm proposing adding negating support to make cases easy where users want to match a wide pattern (as it would be too many to enumerate manually) but then negate matches later.

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
        # The metric name must match at least one of the items.
        # This is an optional field.
        meric_names: [
            "host/cpu/*",       # glob matching
            "host/cpu/usage",   # exact matching
            "/cpu/",            # regex matching
            "!host/cpu/total"   # inverse match so that host/cpu/total isn't dropped
        ]

Globs are used extensively across many platforms, databases, shells, etc. The regex syntax of /cpu/ is in many languages like JavaScript, Perl, sed, etc. I'm not sure how extensive ! is used but it is used by gitignore for pattern matching.

Backwards Compatability

To keep existing configs backwards-compatible I'm proposing making match_type optional (it is currently require). If a user specifies a match_type value the current behavior will be maintained. If match_type is not set then the new syntax described here will be used.

The text was updated successfully, but these errors were encountered:

bogdandrutu · 2020-06-04T18:28:19Z

@jrcamp:

Not sure I understand how the "/cpu/" is different from "host/cpu/usage". Can you explain how can you know when to apply fix match versus regexp match in this case?
"!host/cpu/total" - this is just for convenience because you can still achieve this with a regexp, or am I missing something?

I was doing some refactoring around the filterset and found the current config hard to use as well. I do believe that it is an important optimization to know when a pattern is regexp or strict string comparison.

So my proposed alternative is:

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
        # The metric name must match at least one of the items.
        # This is an optional field.
        meric_names:
            regexp:
                # other properties like cache
                filter_values: [
                    "host/cpu/*",       # glob matching
                ]
            strict:
                filter_values: [
                    "host/cpu/usage",   # exact matching
                ]

jrcamp · 2020-06-04T18:47:02Z

Not sure I understand how the "/cpu/" is different from "host/cpu/usage". Can you explain how can you know when to apply fix match versus regexp match in this case?

Only if it starts and ends with / would it be regex. host/cpu/usage doesn't start and end with / so it's just an exact match.

!host/cpu/total" - this is just for convenience because you can still achieve this with a regexp, or am I missing something?

How do you match host/cpu/.* with regex but not host/cpu/total. Is it possible without negative lookahead (which I don't think Go supports?)

You can see more examples here that we use in the SignalFx agent: https://docs.signalfx.com/en/latest/integrations/agent/filtering.html

I do believe that it is an important optimization to know when a pattern is regexp or strict string comparison.

You do know based on the syntax defined above of /cpu/ for regex and anything else is a glob. (Exact matching is a subset of glob matching) because host/cpu/usage evaluated like a glob is the same thing as doing exact matching, though maybe you can make it more efficient by bypassing glob evaluation.

keitwb · 2020-06-04T19:02:41Z

In the Smart Agent, we settled on the way Jay is proposing because it was easier to write and understand from the end user perspective, with less syntax and structure to remember. Performance-wise it can be written in such a way that it is no different.

I'm not sure how necessary the negation is if you are going to keep the exclude/include sections. Is there any functionality it provides that cannot already be done with the existing config or is it more of a convenience? It isn't clear what it means when it says "the include properties are always checked before the exclude properties". Does that mean that if something matches in includes then it doesn't even both checking excludes?

jrcamp · 2020-06-04T19:20:17Z

I'm not sure how necessary the negation is if you are going to keep the exclude/include sections.

Yeah that's true, if you always have include and exclude sections it's probably not useful. I was thinking I saw some kind of filtering in OT that was just a single list without separate sections where I thought it might be useful but I forget where.

Can always be added later if needed. I think we can drop it from the proposal for now.

bogdandrutu · 2020-06-04T20:23:54Z

Is any standard that defines /text/ to be a regexp vs strict check?

jrcamp · 2020-06-04T23:24:45Z

Is any standard that defines /text/ to be a regexp vs strict check?

You mean what if you have a filter that's /text/ that you want to match as an exact? Some possibilities:

You have to use the regex /^/text/$/ It's suboptimal but it's not a common case.
Could fall back to setting match_type but then we couldn't deprecate it.
Maybe could have some way to escape it but I couldn't think of anything very clean.

@keitwb best I can tell you'd just have to use the regex in SA, am I missing anything?

bogdandrutu · 2020-06-04T23:37:09Z

But for your initial example why did you use /cpu/ instead of .*cpu.*?

jrcamp · 2020-06-04T23:45:42Z

For demonstration purposes only :) Yes a better example would probably have been /.*cpu.*/

The vast majority of the times globs are sufficient and easier though: *cpu*

bogdandrutu · 2020-06-04T23:49:36Z

@jrcamp I am confused:

            "host/cpu/*",       # glob matching
            "host/cpu/usage",   # exact matching
            "/cpu/",            # regex matching

Then why do you need the first and third to be "different" I mean the are both glob matching.

james-bebbington · 2020-06-05T05:35:40Z

Is any standard that defines /text/ to be a regexp vs strict check?

I don't know if there's an explicit standard that defines regex's to be enclosed in slashes, but its a very common convention (as Jay mentioned, this is the case for JavaScript, Perl, sed, etc) - see: https://en.wikipedia.org/wiki/Regular_expression#Delimiters

why do you need the first and third to be "different" I mean the are both glob matching.

Glob is different from regex: https://en.wikipedia.org/wiki/Glob_(programming)#Compared_to_regular_expressions

I quite like Jay's proposal for the ease of configuration, but a couple of questions/concerns:

What if a user wanted to do a strict match on Foo*? I guess they could use /^Foo\*$/, but it makes a simple case quite complex.
It's not immediately obvious how negation matches would be derived, i.e. consider:

include:
  # matches "foo" OR "foobar"
  names: [ "foo", "foobar" ]

include:
  # matches "foo*" OR "!foobar" ==> matches "bar"
  names: [ "foo*", "!foobar" ]

In the code we could treat negation differently by taking the union of all regular entries, and then subtract the negation entries from that, but I feel that is more confusing than just keeping the include/exclude filters.

Where do the existing cacheenabled & cachemaxentries properties fit into this proposal if anywhere?

jrcamp · 2020-06-05T14:51:47Z

why do you need the first and third to be "different" I mean the are both glob matching.

Glob is different from regex: https://en.wikipedia.org/wiki/Glob_(programming)#Compared_to_regular_expressions

👍

What if a user wanted to do a strict match on Foo*? I guess they could use /^Foo\*$/, but it makes a simple case quite complex.

Either regex or we allow * to be escaped (this means the escape has to be escaped as well). e.g.:

Foo\* for matching text Foo*
Foo\\ for matching text Foo\

It's not immediately obvious how negation matches would be derived, i.e. consider:

Ben made a good point, there's really no need for negation when you have the include/exclude concept. However, negation is an alternative to having include/exclude if we wanted. See our docs in case you haven't yet for more details and examples on the negation. However I'm ok keeping include/exclude as it makes backwards compat easier.

Where do the existing cacheenabled & cachemaxentries properties fit into this proposal if anywhere?

I think they would still fit under the regex key as optional, same as they do today.

jrcamp · 2020-06-05T15:00:23Z

Either regex or we allow * to be escaped (this means the escape has to be escaped as well). e.g.:

Foo\* for matching text Foo*
Foo\\ for matching text Foo\

Actually, using \ for escaping may not be very nice due to yaml trying to interpret the backslash as a regular string escape.

bogdandrutu · 2020-06-05T15:53:55Z

What I am hearing is that there are corner cases when it comes to unify the filter values in order to determine a strict match vs regexp match, so probably having them separated is not that bad (also for regexp we do have the caching option) should we keep them separated? Also when it comes to the standard that we use for regexp probably following the golang is the easiest for us.

jrcamp · 2020-06-05T16:07:40Z

There are corner cases but do we choose to optimize for them or for the 99% case? I believe we should optimize for the 99% cases.

bogdandrutu · 2020-06-05T18:22:38Z

I am 100% for user first, but having cpu* being able to be interpreted as regexp or as static comparison is not user friendly, and us inventing a new standard for this is also not "user friendly".

So because of that explicitly ask the user what to use is in my opinion more user friendly than having undefined behaviors or having our custom standard.

jrcamp · 2020-06-05T20:10:28Z

Talked to Bogdan offline and the concern is having corner cases with values like * and ? may not be an issue for metric names but could be more of an issue when filtering other kinds of values since the filtering mechanism is generic and not just for metrics. I'm unable to think of an easy way to do escaping in those cases either given the constraints of YAML. (Please let me know if there are ways.)

My proposal building on top of @bogdandrutu's suggestion. I'm proposing separate regex config into its own regexp_config section because for strict and glob they do not need any settings (seems unlikely they would ever need any) so this simplifies things a bit not having to have a filter_values for the most common cases.

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
    # The metric name must match at least one of the items.
    # This is an optional field.
    meric_names:
        regexp_config:
            cache_enabled: true
        
        regexp:
        - host/cpu/.*

        strict:
        - host/cpu/usage

        glob:
        - host/*

Backwards compatibility would be done by checking whether match_type is set or not. If not set it would use the new configuration.

anniefu · 2020-06-05T23:29:04Z

metrics:
    # include and/or exclude can be specified. However, the include properties
    # are always checked before the exclude properties.
    exclude:
    # The metric name must match at least one of the items.
    # This is an optional field.
    meric_names:
        regexp_config:
            cache_enabled: true
        
        regexp:
        - host/cpu/.*

        strict:
        - host/cpu/usage

        glob:
        - host/*

The UX of setting the match type explicitly here is nice. I agree with @bogdandrutu that a new standard, even if it is more intuitive, may not necessarily be more friendly to users.

I do want to raise the concern here that allowing multiple filter types in the same filter will increase processing on the backend. We'd have to run all the incoming metrics/traces through each specified filter type. It'd be more performant for the user to write their strict matches in regexp if they also need regexp power.

jrcamp · 2020-06-09T19:05:40Z

I do want to raise the concern here that allowing multiple filter types in the same filter will increase processing on the backend. We'd have to run all the incoming metrics/traces through each specified filter type. It'd be more performant for the user to write their strict matches in regexp if they also need regexp power.

I'm not I understand the performance concern. At startup time we "parse" the config and collect a set of rules that are to be evaluated whether they be strict, glob, or regex. I'm not sure what the extra overhead is when processing metrics?

james-bebbington · 2020-06-09T22:44:02Z

The new proposal looks good to me.

A couple of minor thoughts:

regexp_config could probably be inlined (squashed)
Does it make sense to have the include filter or field name at the top level?

i.e. consider

include:
  metric_names:

vs

metric_names:
  include:

anniefu · 2020-06-13T01:21:16Z

I'm not I understand the performance concern. At startup time we "parse" the config and collect a set of rules that are to be evaluated whether they be strict, glob, or regex. I'm not sure what the extra overhead is when processing metrics?

Sorry, I should have clarified, I was mostly thinking about the case with caching matches enabled.

For example, to support a mixed regexp & strict & glob filterprocessor, I could see some processing logic like:

if regexpFilters.Match(metric) ...
else if strict.Match(metric)...
else if glob.Match(metric)...

The regexp filterset has an option to use a cache to prevent re-running through all the regexp filters right now, but then you have the additional overhead of running through strict and glob-type filters again (though they could also cache separately).

In this case, it's probably best to have the caching option on the top-level of the filter config instead of just regexp_config, so you could do something like this:

if prevResult, ok := cache[metricName] {
return prevResult
}
if regexpFilters.Match(metricName) ...
else if strict.Match(metricName)...
else if glob.Match(metricName)...

james-bebbington · 2020-10-12T20:58:22Z

What is the status of this? Do we still want to make changes to the filter config or are we okay with how it currently is?

Is it possible to make these packages not internal so that they can be used in Contrib?

@jrcamp @bogdandrutu

james-bebbington · 2020-10-13T02:49:40Z

Another Proposal:

I looked at what a few other systems do. I didn't find anything super useful to be gleaned as most systems use filter configuration that is specific to the task at hand, and it seems relatively rare to offer strict, glob, and regex options.

Anyway. here is an alternative to what Jay suggested where we keep the "type" property and change how include/exclude works to reduce the amount of nesting needed.

Goal 1: Streamline the Config

At the moment, the config can be a little verbose. Some simplifications we can make:

Remove a level of nesting by making include/exclude leaf level instead of root level
Remove a level of nesting by squashing regex options (which will be relevant for glob as well)
Make "strict" the default so "type" can be omitted for simple cases

Proposal:

include: <string or array>
exclude: <string or array>
type: <strict | glob | regexp> # strict by default
cache_enabled: <bool> # optional
cache_maxentries: <int> # only applies if cache_enabled = true

Two limitations of this approach:

You can't have a different "type" for include & exclude filters, but that doesn't limit the functionality as anything that could be expressed via strict or glob can be expressed via regexp if needed. This optimizes for the 99% case.
You can't give a custom name to the filter field. I don't think this matters as if a user wants to filter on two different fields, they can just introduce a level of nesting (see the second example below)

Simple case:

Current:

filter:
  metrics:
    include:
      metric_names: "hello world"
      match_type: strict

With Proposal:

filter:
  metrics:
    include: "hello world"

Multiple filters:

Current:

hostmetrics:
  scrapers:
    filesystem:
      include_devices:
        devices: [ "^[A-Z]:$" ]
        match_type: regexp
        regexp_config:
          cache_enabled: true
      exclude_devices:
        devices: [ "^D:$", "^E:$" ]
        match_type: regexp
        regexp_config:
          cache_enabled: true
      include_fs_types:
        fs_types: "Ext2"
        match_type: strict

With Proposal:

hostmetrics:
  scrapers:
    filesystem:
      devices:
        include: [ "^[A-Z]:$" ]
        exclude: [ "^D:$", "^E:$" ]
        match_type: regexp
        cache_enabled: true
      fs_types:
        include: "Ext2"

Goal 2: Simplify the Code to use the filter

The current code requires quite a bit of additional scaffolding to use. The suggested new config should simplify this significantly:

Embeds include & exclude within one filter object, reducing the number of filtersets that need to be used by half
Removes the need to create a Filter config struct to set the name of the field being filtered on

Current:

type Config struct {
    MyFieldIncludeFilter MyFieldMatchConfig `mapstructure:",include_my_field"`
    MyFieldExcludeFilter MyFieldMatchConfig `mapstructure:",exclude_my_field"`
}

type MyFieldMatchConfig struct {
    filterset.Config `mapstructure:",squash"`
    MyFields []string `mapstructure:"my_fields"`
}

...

# initialization code
includeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldIncludeFilter)
if err != nil {
    return nil, err
}

excludeFilter, err := filterset.CreateFilterSet(items, cfg.MyFieldExcludeFilter)
if err != nil {
    return nil, err
}

With Proposal:

type Config struct {
    MyFieldFilter MatchConfig `mapstructure:",my_field"`
}

...

# initialization code
filter, err := filterset.CreateFilterSet(items, cfg.MyFieldFilter)
if err != nil {
    return nil, err
}

The filtermetric & filterspan packages will become almost redundant, save for one helper function.

Goal 3: Expose the filters for use externally

Self explanatory. Blocks open-telemetry/opentelemetry-collector-contrib#1088 (comment)

jrcamp · 2020-10-13T18:01:45Z

I think it's an improvement and I think the downsides are fine. Will we be able to keep backwards compatibility in a reasonable way or would it require breaking configs?

james-bebbington · 2020-10-13T21:40:24Z

It would require breaking configs 😦

But I don't think that's avoidable if we want to simplify the config

amanbrar1999 · 2020-10-20T22:21:06Z

@jrcamp @james-bebbington can Goal 3 of this proposal be done independently of the others in order to unblock issues open-telemetry/opentelemetry-collector-contrib#1088 and #1892? or do Goals 1 and 2 stand as prerequisites?

edit: This was answered in #1892

…lemetry#1081) * Bump github.com/itchyny/gojq from 0.11.0 to 0.11.1 in /tools Bumps [github.com/itchyny/gojq](https://github.com/itchyny/gojq) from 0.11.0 to 0.11.1. - [Release notes](https://github.com/itchyny/gojq/releases) - [Changelog](https://github.com/itchyny/gojq/blob/master/CHANGELOG.md) - [Commits](itchyny/gojq@v0.11.0...v0.11.1) Signed-off-by: dependabot[bot] <[email protected]> * Auto-fix go.sum changes in dependent modules Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: dependabot[bot] <dependabot[bot]@users.noreply.github.com> Co-authored-by: Tyler Yahn <[email protected]>

…etry#1081) Bumps [requests](https://github.com/psf/requests) from 2.27.0 to 2.27.1. - [Release notes](https://github.com/psf/requests/releases) - [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md) - [Commits](psf/requests@v2.27.0...v2.27.1) --- updated-dependencies: - dependency-name: requests dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

jrcamp added the feature request label Jun 4, 2020

jrcamp mentioned this issue Jun 4, 2020

Filter Processor Proposal #560

Closed

jrcamp mentioned this issue Jun 4, 2020

Host Metrics process scraper #1047

Merged

james-bebbington mentioned this issue Jun 10, 2020

aggregation processor rename functionality JingboWangGoogle/opentelemetry-collector#1

Closed

jrcamp mentioned this issue Jun 12, 2020

Support regex matching against values in attributesprocessor #979

Closed

jrcamp added this to the Backlog milestone Jul 30, 2020

keitwb mentioned this issue Sep 24, 2020

Add exprfilterprocessor skeleton #1844

Closed

james-bebbington mentioned this issue Oct 20, 2020

Proposal: Metrics Label Processing Solutions #1892

Closed

james-bebbington mentioned this issue Nov 2, 2020

Add metric filter option to metrics transform processor open-telemetry/opentelemetry-collector-contrib#1447

Merged

andrewhsu added enhancement New feature or request and removed feature request labels Jan 6, 2021

github-actions bot added the Stale label Jan 11, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2023

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024

Fix open-telemetry#1081 (open-telemetry#1082)

caade51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve filtering configuration with globs and make match_type optional #1081

Improve filtering configuration with globs and make match_type optional #1081

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

keitwb commented Jun 4, 2020 •

edited

Loading

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

james-bebbington commented Jun 5, 2020 •

edited

Loading

jrcamp commented Jun 5, 2020

jrcamp commented Jun 5, 2020

bogdandrutu commented Jun 5, 2020

jrcamp commented Jun 5, 2020

bogdandrutu commented Jun 5, 2020

jrcamp commented Jun 5, 2020

anniefu commented Jun 5, 2020 •

edited

Loading

jrcamp commented Jun 9, 2020

james-bebbington commented Jun 9, 2020 •

edited

Loading

anniefu commented Jun 13, 2020

james-bebbington commented Oct 12, 2020

james-bebbington commented Oct 13, 2020 •

edited

Loading

jrcamp commented Oct 13, 2020

james-bebbington commented Oct 13, 2020

amanbrar1999 commented Oct 20, 2020 •

edited

Loading

Improve filtering configuration with globs and make match_type optional #1081

Improve filtering configuration with globs and make match_type optional #1081

Comments

jrcamp commented Jun 4, 2020

Background

Proposal

Backwards Compatability

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

keitwb commented Jun 4, 2020 • edited Loading

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

jrcamp commented Jun 4, 2020

bogdandrutu commented Jun 4, 2020

james-bebbington commented Jun 5, 2020 • edited Loading

jrcamp commented Jun 5, 2020

jrcamp commented Jun 5, 2020

bogdandrutu commented Jun 5, 2020

jrcamp commented Jun 5, 2020

bogdandrutu commented Jun 5, 2020

jrcamp commented Jun 5, 2020

anniefu commented Jun 5, 2020 • edited Loading

jrcamp commented Jun 9, 2020

james-bebbington commented Jun 9, 2020 • edited Loading

anniefu commented Jun 13, 2020

james-bebbington commented Oct 12, 2020

james-bebbington commented Oct 13, 2020 • edited Loading

jrcamp commented Oct 13, 2020

james-bebbington commented Oct 13, 2020

amanbrar1999 commented Oct 20, 2020 • edited Loading

keitwb commented Jun 4, 2020 •

edited

Loading

james-bebbington commented Jun 5, 2020 •

edited

Loading

anniefu commented Jun 5, 2020 •

edited

Loading

james-bebbington commented Jun 9, 2020 •

edited

Loading

james-bebbington commented Oct 13, 2020 •

edited

Loading

amanbrar1999 commented Oct 20, 2020 •

edited

Loading