Skip to content

GelfEncoder.convertToNumberIfNeeded Performance and exception as control-flow #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
deathy opened this issue Aug 22, 2024 · 7 comments
Closed

Comments

@deathy
Copy link

deathy commented Aug 22, 2024

Describe the bug

GelfEncoder.convertToNumberIfNeeded attempts conversion to a number using new BigDecimal constructor and control-flow is basically handled via exception.

As far as I can see this is attempted on all fields including standard fields like 'logger_name', 'thread_name', 'level_name' so basically guaranteeing at least 3 exceptions for any kind of log message even without custom/additional fields.

This can add up to a lot given that stack trace population for the exception can be quite expensive.

Below an async-profiler partial flame graph, environment:

  • logback FILE appender and GELF UDP appender configured for root appender
  • logging multiple INFO messages
  • flame graph only partial, rooted at one instance of Logger.info

convertToNumberIfNeeded just overwhelms everything compared to file appender or actual compression and UDP sending.

gelf_encoder_number_conversion_stacktrace

Thoughts:

  • should not call BigInteger for conversion if field value is definitely not a number
    • simple character iteration and fail on anything outside +-0123456789.eE (and char array BigInteger constructor could be used then if iterating)
  • maybe some obvious fields like logger_name could be excluded?
  • any history on numbersAsString configuration option and its default false setting?
    • I guess it may be hard to change the default now but maybe possible performance impact could be documented? Not it's not completely clear what it does (that it tries for all fields)

I can probably attempt a PR for the first numeric check sometime in the next couple of days.

Additional context

  • Java distribution and version in use (output of java -version)
    • OpenJDK Runtime Environment Temurin-21.0.2+13
  • SLF4J and Logback version in use
    • slf4j-api 2.0.13, logback 1.5.6
  • Graylog version in use
    • Graylog 6.0.3
@osiegmar
Copy link
Owner

Thanks for bringing this up!

About the history of this:

  • Number conversion was added in version 2.1.0 with log configured values as integer #30. To maintain backwards compatibility it was disabled by default.
  • To fix scientific notation (which is not supported by the GELF protocol) the change 18d3fa2 was made in version 3.0.0.
  • Also in version 3.0.0 the default for numbersAsString was changed from true to false as it seemed always desirable to have numbers as numbers in the search index to allow proper search capabilities.

As you pointed out, the performance penalty unfortunately is quite heavy. Maybe you want to analyze the performance impact of the following?

private Object convertToNumberIfNeeded(final Object value) {
    if (numbersAsString || !(value instanceof String)) {
        return value;
    }

    // Simple check if the string could be a number to avoid the performance overhead of exception handling
    final char[] ca = ((String) value).toCharArray();
    for (char c : ca) {
        if (!isBigDecimalChar(c)) {
            return value;
        }
    }

    try {
        return new BigDecimal(ca, 0, ca.length);
    } catch (final NumberFormatException e) {
        return value;
    }
}

private static boolean isBigDecimalChar(final char c) {
    return (c >= '0' && c <= '9')
           || c == '.'
           || c == '+' || c == '-'
           || c == 'E' || c == 'e';
}

Excluding fields like logger_name probably required a bit more refactoring. I'm not sure if it's worth the effort. Let's wait for the results of the performance analysis.

@deathy
Copy link
Author

deathy commented Aug 23, 2024

Very rough initial test:

Added a GelfEncoderTest test based on simple one:

@Test
void millionEncodeLoop() {
    encoder.start();
    final LoggerContext lc = (LoggerContext) LoggerFactory.getILoggerFactory();
    final Logger logger = lc.getLogger(LOGGER_NAME);

    long startTime = System.nanoTime();
    for (int i = 0; i < 1_000_000; i++) {
        assertThatJson(encodeToStr(simpleLoggingEvent(logger, null)))
                .node("full_message").isEqualTo("message 1\\n");
    }
    System.out.printf("Time for 1M encodes: %,d ns.\n", System.nanoTime() - startTime);
}

Added both version with/without numbersAsString=true, ran about 8 times on JDK 21, minimal variation between runs.

Before changes:

Time for 1M encodes: 20,478,114,200 ns.
Time for 1M encodes with numbersAsString=true: 8,702,857,900 ns.

Time for 1M encodes: 20,071,830,100 ns.
Time for 1M encodes with numbersAsString=true: 8,328,012,600 ns.

after changes exactly as you described:

Time for 1M encodes: 9,502,125,300 ns.
Time for 1M encodes with numbersAsString=true: 8,516,555,000 ns.

Time for 1M encodes: 9,480,593,100 ns.
Time for 1M encodes with numbersAsString=true: 8,577,866,600 ns.

So roughly from more than 2X slower to only 10-13% slower.

Note: Test time includes the JSON assert time on output but that should be constant between code versions. (And avoids JVM optimizing away any operations since output is fully consumed)

Then I thought about other factors: stack depth and number of fields.

Stack depth in test runner due to Gradle and JUnit was already 86 elements deep so I don't think there's a need to simulate a deeper stack.

Number of fields is significant since all are processed, so I made another test where I just event.addKeyValuePair(new KeyValuePair("field1","value1")); adding a bunch of key value fields.

A bit later I'll run and post results also on JDK11 with all combinations:

  • before/after
  • with/without numbersAsString
  • with/without additional field pairs

@osiegmar
Copy link
Owner

osiegmar commented Aug 23, 2024

It just came to my mind, that maybe it's sufficient (and more efficient) to only check if the first character is a digit.

private Object convertToNumberIfNeeded(final Object value) {
    if (numbersAsString || !(value instanceof String)
        || !isBigDecimalChar(((String) value).charAt(0))) {
        return value;
    }

    try {
        return new BigDecimal((String) value);
    } catch (final NumberFormatException e) {
        return value;
    }
}

private static boolean isBigDecimalChar(final char c) {
    return (c >= '0' && c <= '9')
           || c == '.'
           || c == '+' || c == '-';
}

I also did a quick test, and it seems to further improve the performance. But that, of course, highly depends on the values passed to the method...

@deathy
Copy link
Author

deathy commented Aug 23, 2024

Regarding just checking first character:

Maybe more efficient but also more unstable? As you said clearly depends on the data, see value1 vs 1stvalue

I can also imagine very common log values that will trick this and go to the very slow path of exception:

  • IP addresses 10.0.0.1
    • well this will be slow even in the first version, but I don't think reinventing a number parser would be wanted here
  • IP address and port pair 127.0.0.1:8080
    • will not trigger exception in first code version, would trigger it with only checking first char
  • IPv6 address such as 2001:db8::8a2e:370:7334
    • no exception in first case because of ":", but first character is a digit so very slow exception path with second code variant
  • certain kinds of UUIDs
    • these will be all over the place depending on first char
    • actually for any kind of well-distributed hash in hex you'd have 10/16 chance of first character being a digit but then going the very slow path through exception

Plus the first version of the code seems more clear since it's checking the full value.

@deathy
Copy link
Author

deathy commented Aug 23, 2024

Finished test runs and attached results below including .patch with tests I used for benchmarking

Sample measurements:

Before:

Time for 1M encodes: 19,000,709,000 ns.
Time for 1M encodes with numbersAsString=true: 8,218,149,800 ns.
Time for 1M encodes with 5 key-value pairs: 46,252,780,700 ns.
Time for 1M encodes with 5 key-value pairs and numbersAsString=true: 9,615,291,900 ns.
Time for 1M encodes with 10 key-value pairs: 72,548,223,100 ns.
Time for 1M encodes with 10 key-value pairs and numbersAsString=true: 12,112,090,100 ns.


Time for 1M encodes: 18,903,051,500 ns.
Time for 1M encodes with numbersAsString=true: 8,555,006,800 ns.
Time for 1M encodes with 5 key-value pairs: 46,891,360,500 ns.
Time for 1M encodes with 5 key-value pairs and numbersAsString=true: 9,971,852,000 ns.
Time for 1M encodes with 10 key-value pairs: 72,963,828,400 ns.
Time for 1M encodes with 10 key-value pairs and numbersAsString=true: 12,080,028,700 ns.

After (with your original code proposal):

Time for 1M encodes: 8,306,923,000 ns.
Time for 1M encodes with numbersAsString=true: 8,230,593,600 ns.
Time for 1M encodes with 5 key-value pairs: 9,995,377,600 ns.
Time for 1M encodes with 5 key-value pairs and numbersAsString=true: 9,614,381,200 ns.
Time for 1M encodes with 10 key-value pairs: 11,114,532,800 ns.
Time for 1M encodes with 10 key-value pairs and numbersAsString=true: 12,274,562,900 ns.


Time for 1M encodes: 8,281,705,100 ns.
Time for 1M encodes with numbersAsString=true: 8,373,099,100 ns.
Time for 1M encodes with 5 key-value pairs: 10,040,063,400 ns.
Time for 1M encodes with 5 key-value pairs and numbersAsString=true: 9,837,812,200 ns.
Time for 1M encodes with 10 key-value pairs: 11,176,946,700 ns.
Time for 1M encodes with 10 key-value pairs and numbersAsString=true: 12,162,474,000 ns.

Summary of approximate times in seconds (there's some variation so picked random samples, full times attached):

Test Before After
1M encodes 19 s 8.2 s
1M encodes with numbersAsString=true 8.2 s 8.2 s
1M encodes with 5 key-value pairs 46.3 s 9.9 s
1M encodes with 5 key-value pairs and numbersAsString=true 9.5 s 9.5 s
1M encodes with 10 key-value pairs 72.6 s 11.1 s
1M encodes with 10 key-value pairs and numbersAsString=true 12.1 s 12.1 s

All cases are faster.
You can also see impact of number of fields.

What's funny is '1M encodes with 10 key-value pairs' is faster without numbersAsString=true now. But it's a very minor variation and maybe it's just better pre-fetching since we're always reading the string characters fully....

GelfEncoderTest_additional.patch
GelfEncoderTest_measurements_after.txt
GelfEncoderTest_measurements_before.txt

@osiegmar
Copy link
Owner

Thanks for analyzing! I will release 6.0.2 shortly.

@deathy
Copy link
Author

deathy commented Aug 24, 2024

Thanks!

Last note, minor amusement, I profiled also my test and it turns out ~30% of time is spent in JsonUnit assert because it also does check for number via exception using Double.parseDouble ... (meanwhile convertToNumberIfNeeded is almost invisible, 0.16% )

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants