Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 2.28.6: huge increase in exceptions thrown but caught #5933

Open
1 task done
Hexcles opened this issue Mar 5, 2025 · 16 comments
Open
1 task done

Regression in 2.28.6: huge increase in exceptions thrown but caught #5933

Hexcles opened this issue Mar 5, 2025 · 16 comments
Labels
blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. bug This issue is a bug. p2 This is a standard priority issue potential-regression Marking this issue as a potential regression to be checked by team member

Comments

@Hexcles
Copy link

Hexcles commented Mar 5, 2025

/### Describe the bug

After upgrading to 2.30.x (we tried both .0 and .25), we see a huge increase in the number of NumberFormatException thrown by the SDK from e.g. async DynamoDB request (and potentially others, too). Even though they are caught by the SDK, the sheer number of exceptions caused noticeable increased pressure on GC.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

No NumberFormatException, which is the case in 2.29.52 2.28.5.

Current Behavior

Lots of NumberFormatException, which is the case in 2.30.x 2.28.6.

Reproduction Steps

Send any async DynamoDB request and either set a breakpoint or use a profiler to observe the caught exceptions.

Possible Solution

No response

Additional Information/Context

A typical stack trace:

NumberFormatException.<init>(String)
NumberFormatException.forCharSequence(CharSequence, int, int, int)
Integer.parseInt(CharSequence, int, int, int)
URI$Parser.scanByte(int, int)
URI$Parser.scanIPv4Address(int, int, boolean)
URI$Parser.parseIPv4Address(int, int)
URI$Parser.parseServer(int, int, boolean)
URI$Parser.parseAuthority(int, int)
URI$Parser.parseHierarchical(int, int)
URI$Parser.parse(boolean)
URI.<init>(String)
URI.create(String)
SdkHttpRequest.getUri()
V4CanonicalRequest.getCanonicalUri(SdkHttpRequest, V4CanonicalRequest$Options)
V4CanonicalRequest.canonicalUri()
V4CanonicalRequest.getCanonicalRequestString()
DefaultV4RequestSigner.sign(SdkHttpRequest$Builder)
V4RequestSigner.lambda$header$0(V4Properties, SdkHttpRequest$Builder)
DefaultAwsV4HttpSigner.doSign(SignRequest, Checksummer, V4RequestSigner, V4PayloadSigner)
DefaultAwsV4HttpSigner.sign(SignRequest)
AsyncSigningStage.doSraSign(SdkHttpFullRequest, RequestExecutionContext, SelectedAuthScheme, Identity)
AsyncSigningStage.lambda$sraSignRequest$1(SdkHttpFullRequest, RequestExecutionContext, SelectedAuthScheme, Identity)
MetricUtils.reportDuration(Supplier, MetricCollector, SdkMetric)
AsyncSigningStage.lambda$sraSignRequest$3(SdkHttpFullRequest, RequestExecutionContext, SelectedAuthScheme, Identity)
CompletableFuture.uniComposeStage(Executor, Function)
CompletableFuture.thenCompose(Function)
AsyncSigningStage.sraSignRequest(SdkHttpFullRequest, RequestExecutionContext, SelectedAuthScheme)
AsyncSigningStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncSigningStage.execute(Object, RequestExecutionContext)
RequestPipelineBuilder$ComposingRequestPipelineStage.execute(Object, RequestExecutionContext)
AsyncApiCallAttemptMetricCollectionStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncApiCallAttemptMetricCollectionStage.execute(Object, RequestExecutionContext)
AsyncRetryableStage$RetryingExecutor.attemptExecute(CompletableFuture)
AsyncRetryableStage$RetryingExecutor.attemptFirstExecute(CompletableFuture)
AsyncRetryableStage$RetryingExecutor.execute()
AsyncRetryableStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncRetryableStage.execute(Object, RequestExecutionContext)
RequestPipelineBuilder$ComposingRequestPipelineStage.execute(Object, RequestExecutionContext)
AsyncExecutionFailureExceptionReportingStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncExecutionFailureExceptionReportingStage.execute(Object, RequestExecutionContext)
AsyncApiCallTimeoutTrackingStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncApiCallTimeoutTrackingStage.execute(Object, RequestExecutionContext)
AsyncApiCallMetricCollectionStage.execute(SdkHttpFullRequest, RequestExecutionContext)
AsyncApiCallMetricCollectionStage.execute(Object, RequestExecutionContext)
RequestPipelineBuilder$ComposingRequestPipelineStage.execute(Object, RequestExecutionContext)
AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(TransformingAsyncResponseHandler)
BaseAsyncClientHandler.invoke(SdkClientConfiguration, SdkHttpFullRequest, AsyncRequestBody, SdkRequest, ExecutionContext, TransformingAsyncResponseHandler)
BaseAsyncClientHandler.doExecute(ClientExecutionParams, ExecutionContext, TransformingAsyncResponseHandler)
BaseAsyncClientHandler.lambda$execute$1(ClientExecutionParams)
BaseAsyncClientHandler.measureApiCallSuccess(ClientExecutionParams, Supplier)
BaseAsyncClientHandler.execute(ClientExecutionParams)
AwsAsyncClientHandler.execute(ClientExecutionParams)
DefaultDynamoDbAsyncClient.lambda$query$203(HttpResponseHandler, HttpResponseHandler, SdkClientConfiguration, MetricCollector, QueryRequest, URI)
CompletableFuture.uniComposeStage(Executor, Function)
CompletableFuture.thenCompose(Function)
DefaultDynamoDbAsyncClient.query(QueryRequest)

AWS Java SDK version used

2.28.6 (also reproduced on 2.29.52, 2.30.0 and 2.30.25)

JDK version used

Temurin OpenJDK 21

Operating System and version

Ubuntu Jammy

@Hexcles Hexcles added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 5, 2025
@debora-ito
Copy link
Member

debora-ito commented Mar 6, 2025

@Hexcles can you share a SSCCE code example we can use to reproduce?

@bhoradc bhoradc added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. p1 This is a high priority issue potential-regression Marking this issue as a potential regression to be checked by team member and removed needs-triage This issue or PR still needs to be triaged. labels Mar 6, 2025
@Hexcles
Copy link
Author

Hexcles commented Mar 7, 2025

Unfortunately I don't have the bandwidth to do that this month. I used all my spare cycles on bisecting to narrow down the regression to a single release (2.29.52..2.30.0) and it's no longer urgent for us as we pin the version to 2.29. Hopefully with the stack trace you can find somewhere to start? Apologies and thank you.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 7, 2025
@debora-ito
Copy link
Member

debora-ito commented Mar 7, 2025

The hint we get from the stacktrace is that the exception is coming from parsing the request canonical URI:

NumberFormatException.<init>(String)
NumberFormatException.forCharSequence(CharSequence, int, int, int)
Integer.parseInt(CharSequence, int, int, int)
URI$Parser.scanByte(int, int)
URI$Parser.scanIPv4Address(int, int, boolean)
URI$Parser.parseIPv4Address(int, int)
URI$Parser.parseServer(int, int, boolean)
URI$Parser.parseAuthority(int, int)
URI$Parser.parseHierarchical(int, int)
URI$Parser.parse(boolean)
URI.<init>(String)
URI.create(String)
SdkHttpRequest.getUri()
V4CanonicalRequest.getCanonicalUri(SdkHttpRequest, V4CanonicalRequest$Options)
V4CanonicalRequest.canonicalUri()
V4CanonicalRequest.getCanonicalRequestString()
...

Are you overriding the request endpoint, or using the default configuration?
Can you share a snippet of how you're creating the DynamoDB client?

@debora-ito debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 7, 2025
@Hexcles
Copy link
Author

Hexcles commented Mar 7, 2025

First of all, apologies that I apparently got the bisecting wrong -- we are down to (2.27.4 .. 2.29.52) instead. I'll see if I can get a better/smaller reproduction to help with my own bisecting, too.

And here's another stack trace with line numbers captured from a debugger in 2.29.52

NumberFormatException.<init>(String):L#54
NumberFormatException.forCharSequence(CharSequence, int, int, int):L#83
Integer.parseInt(CharSequence, int, int, int):L#743
URI$Parser.scanByte(int, int):L#3429
URI$Parser.scanIPv4Address(int, int, boolean):L#3459
URI$Parser.parseIPv4Address(int, int):L#3493
URI$Parser.parseServer(int, int, boolean):L#3392
URI$Parser.parseAuthority(int, int):L#3304
URI$Parser.parseHierarchical(int, int):L#3240
URI$Parser.parse(boolean):L#3196
URI.<init>(String):L#645
URI.create(String):L#930
SdkHttpRequest.getUri():L#157
RequestAdapter.addHeadersToRequest(DefaultHttpRequest, SdkHttpRequest):L#83
RequestAdapter.adapt(SdkHttpRequest):L#56
NettyRequestExecutor.makeRequest():L#236
NettyRequestExecutor.lambda$makeRequestListener$10():L#181
NettyUtils.doInEventLoop(EventExecutor, Runnable):L#248
NettyRequestExecutor.makeRequestListener(Future):L#177
DefaultPromise.notifyListener0(Future, GenericFutureListener):L#590
DefaultPromise.notifyListeners0(DefaultFutureListeners):L#583
DefaultPromise.notifyListenersNow():L#559
DefaultPromise.access$200(DefaultPromise):L#35
DefaultPromise$1.run():L#503
PromiseTask.runTask():L#98
PromiseTask.run():L#106
AbstractEventExecutor.runTask(Runnable):L#173
AbstractEventExecutor.safeExecute(Runnable):L#166
SingleThreadEventExecutor.runAllTasks(long):L#472
EpollEventLoop.run():L#405
SingleThreadEventExecutor$4.run():L#998
ThreadExecutorMap$2.run():L#74
Thread.runWith(Object, Runnable):L#1596
Thread.run():L#1583

Also, I got the error message: Error at index 9 in: "<redcated>" where redacted is actually our account ID. We don't override our dynamodb endpoint and based on our tracing it's dynamodb.us-east-1.amazonaws.com.

@Hexcles
Copy link
Author

Hexcles commented Mar 7, 2025

Also note that these exceptions are caught by SDK and the requests do succeed in the end

@Hexcles
Copy link
Author

Hexcles commented Mar 7, 2025

Oh I think it's https://github.com/aws/aws-sdk-java-v2/releases/tag/2.28.6

Generate account endpoint for DynamoDB requests when the account ID is available

And internally within the SDK we got <account-id>.ddb.us-east-1.amazonaws.com. Then Uri.create in JDK does this exception-as-control-flow thing as it first tries to parse the domain as an IP: https://github.com/frohoff/jdk8u-jdk/blob/da0da73ab82ed714dc5be94acd2f0d00fbdfe2e9/src/share/classes/java/net/URI.java#L3335

That's really unfortunate. Is there a way for us to turn off account-based endpoint for DynamoDB?

@Hexcles Hexcles changed the title Regression in 2.30.0: huge increase in exceptions thrown but caught Regression in 2.28.6: huge increase in exceptions thrown but caught Mar 7, 2025
@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 7, 2025
@bhoradc
Copy link

bhoradc commented Mar 10, 2025

Hi @Hexcles,

Thanks for sharing the insight. For turning off this feature, you may refer to below section from the announcement - #5628

You may instead disable account-based endpoints for DynamoDB by setting the config flag accountIdEndpointMode to disabled, as a system property or environment variable. We recommend against this because the DynamoDB service uses account-based endpoints to ensure performance and scalability.

Regards,
Chaitanya

@bhoradc bhoradc added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 10, 2025
@debora-ito
Copy link
Member

debora-ito commented Mar 11, 2025

@Hexcles the Java SDK 2.28.6 should work fine with DDB account-based endpoints, so I'm curious about what could be causing the NumberFormatException.

Are you using aws endpoints directly, or using a third-party solution endpoint?
Have you had the chance to work on a repro code?

@debora-ito debora-ito added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Mar 11, 2025
@Hexcles
Copy link
Author

Hexcles commented Mar 11, 2025

@debora-ito Let me try to clarify. Everything "works fine"; there's no uncaught exception or any impact on functionalities.

However, there is a non-zero impact on performance due to the sharply increased number of exceptions thrown (and caught) from within java.net.URI due to how the account-based endpoint is structured (it starts with a long sequence of digits) and how the standard JDK implements the parsing (exception as control flow).

To repro, you don't even need the SDK; just run the following code in a debugger and set a breakpoint on any exception thrown (including caught):

import java.net.URI;

public class Main {
    public static void main(String[] args) {
        var uri = URI.create("98765432101.ddb.us-east-1.amazonaws.com");
        System.out.println(uri);
    }
}

There's no exception thrown for e.g. dynamodb.us-east-1.amazonaws.com.

@debora-ito
Copy link
Member

debora-ito commented Mar 11, 2025

Oh I see now, thank you for the clarification.

Can you measure the performance impact somehow? Depending on the impact we may want to take a closer look.

Update: please disregard my strikethrough comment, we'll take a look. Thank you for raising this.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 11, 2025
@RanVaknin
Copy link
Contributor

RanVaknin commented Mar 25, 2025

Hi @Hexcles,

Thanks for reporting this issue. I have spent a good amount of time benchmarking (using JMH) the actual impact caused by the account ID endpoint and the results were interesting.

When testing URI.create() directly, I can indeed see a degradation in overall performance in terms of %, but the overall impact is still negligible when we are looking at the actual time. The benchmark results show that creating URIs with account IDs is significantly slower than standard URIs, with account ID URIs taking about 3-4 times longer on average (3.218 μs vs 0.830 μs) however translating these units to milliseconds you will have to make 1000 requests per/second to experience an addition of 1ms to the overall execution.

Benchmark                                 Mode      Cnt     Score   Error   Units
uriCreateAccountIdEndpoint           thrpt       10     0.311 ± 0.008  ops/us
uriCreateStandardEndpoint            thrpt       10     1.222 ± 0.004  ops/us
uriCreateAccountIdEndpoint            avgt       10     3.218 ± 0.115   us/op
uriCreateStandardEndpoint             avgt       10     0.830 ± 0.001   us/op
uriCreateAccountIdEndpoint          sample  1906430     3.315 ± 0.009   us/op
uriCreateAccountIdEndpoint:p0.00    sample              3.124           us/op
uriCreateAccountIdEndpoint:p0.50    sample              3.268           us/op
uriCreateAccountIdEndpoint:p0.90    sample              3.328           us/op
uriCreateAccountIdEndpoint:p0.95    sample              3.360           us/op
uriCreateAccountIdEndpoint:p0.99    sample              3.572           us/op
uriCreateAccountIdEndpoint:p0.999   sample             14.656           us/op
uriCreateAccountIdEndpoint:p0.9999  sample             21.579           us/op
uriCreateAccountIdEndpoint:p1.00    sample           1865.728           us/op
uriCreateStandardEndpoint           sample  3772995     0.882 ± 0.003   us/op
uriCreateStandardEndpoint:p0.00     sample              0.764           us/op
uriCreateStandardEndpoint:p0.50     sample              0.869           us/op
uriCreateStandardEndpoint:p0.90     sample              0.918           us/op
uriCreateStandardEndpoint:p0.95     sample              0.933           us/op
uriCreateStandardEndpoint:p0.99     sample              0.960           us/op
uriCreateStandardEndpoint:p0.999    sample              2.120           us/op
uriCreateStandardEndpoint:p0.9999   sample             16.160           us/op
uriCreateStandardEndpoint:p1.00     sample           2158.592           us/op
uriCreateAccountIdEndpoint              ss       10    27.228 ± 1.305   us/op
uriCreateStandardEndpoint               ss       10    17.525 ± 3.704   us/op

(Using nano seconds here please forgive the change in unit of measurment)
When testing the a stubbed SDK client (not measuring roundtripping time, and only focusing on building the request and resolving the endpoint) the results are negligible in terms of % ( p90 is 0.1% slower and p99 is 0.3% slower).

Benchmark                              Mode   Cnt         Score         Error   Units
Foo.withAccountEndpoints              thrpt   100        ≈ 10⁻⁸                ops/ns
Foo.withoutAccountEndpoints           thrpt   100        ≈ 10⁻⁸                ops/ns
Foo.withAccountEndpoints               avgt   100  55598535.201 ±  208678.153   ns/op
Foo.withoutAccountEndpoints            avgt   100  56713598.175 ±  537362.920   ns/op
Foo.withAccountEndpoints             sample  1770  58772499.670 ±  248637.587   ns/op
Foo.withAccountEndpoints:p0.00       sample        51707904.000                 ns/op
Foo.withAccountEndpoints:p0.50       sample        59179008.000                 ns/op
Foo.withAccountEndpoints:p0.90       sample        62193664.000                 ns/op
Foo.withAccountEndpoints:p0.95       sample        62783488.000                 ns/op
Foo.withAccountEndpoints:p0.99       sample        65077248.000                 ns/op
Foo.withAccountEndpoints:p0.999      sample        74617978.880                 ns/op
Foo.withAccountEndpoints:p0.9999     sample        75628544.000                 ns/op
Foo.withAccountEndpoints:p1.00       sample        75628544.000                 ns/op
Foo.withoutAccountEndpoints          sample  1776  58508075.820 ±  255582.120   ns/op
Foo.withoutAccountEndpoints:p0.00    sample        51052544.000                 ns/op
Foo.withoutAccountEndpoints:p0.50    sample        59146240.000                 ns/op
Foo.withoutAccountEndpoints:p0.90    sample        62128128.000                 ns/op
Foo.withoutAccountEndpoints:p0.95    sample        62717952.000                 ns/op
Foo.withoutAccountEndpoints:p0.99    sample        64880640.000                 ns/op
Foo.withoutAccountEndpoints:p0.999   sample        70123520.000                 ns/op
Foo.withoutAccountEndpoints:p0.9999  sample        70123520.000                 ns/op
Foo.withoutAccountEndpoints:p1.00    sample        70123520.000                 ns/op
Foo.withAccountEndpoints                 ss   100  60984126.670 ± 1522649.960   ns/op
Foo.withoutAccountEndpoints              ss   100  60812634.540 ± 1114374.923   ns/op

With regards to GC pressure, I used both JMH profiler and Java Flight Recorder to try and profile GC and I'm not seeing any difference in GC between the two scenarios.

Are you able to provide us with the profiling setup you have to measure the excessive GC? I'm having a hard time reproducing it on my end.

Thanks,
Ran~

@RanVaknin RanVaknin added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 25, 2025
@Hexcles
Copy link
Author

Hexcles commented Mar 25, 2025

@RanVaknin thanks for your detailed analysis! I agree with your results.

I did some digging, too, and couldn't come up with a standalone repro that can show meaningful performance difference, either. My current best guess is it has something to do with our APM setup. We have a relatively high sample rate for exceptions (including caught ones) in our java agent, which would capture the exception, attach the full stacktrace and error message to the current span (it's a big JSON), and create a profiling event. We usually have <10 exceptions per second, but this pushed our exception rate to >>100 per second (aside: this made many of our internal exception metrics useless/noisy, too -- we otherwise have a good exception hygiene and only use them for truly exceptional cases, which is also why we can get away with a high exception sample rate). And we can't easily exclude this exception as the skip list is by type, and NumberFormatException is a built-in exception that our business code can also throw.

I guess one thing that's not clear to me is the advantage of using this account-specific endpoint.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Mar 26, 2025
@pengxiaolong
Copy link

pengxiaolong commented Mar 26, 2025

The JDK API URI.create can be expensive for some specific cases, this one is caused by the NumberFormatException, I fixed another one caused by URISyntaxException recently. Even w/o the exception URI.create is expensive(We found this in a project helping internal team to optimize service performance) because by default DefaultV4RequestSigner parses URI using URI.create for every single request to normalize URI path, I think URI path normalization can be disabled for DDB client to avoid the expensive URI parsing, just like this PR did for S3.

@RanVaknin
Copy link
Contributor

RanVaknin commented Mar 31, 2025

Hi @Hexcles,

Thanks for the follow up. I definitely agree that the increase in exception caught and thrown is an issue. I have contacted the Corretto team on Amazon to see if they can help with driving an improvement in this JDK's implementation of URI.create().

I guess one thing that's not clear to me is the advantage of using this account-specific endpoint.

Unfortunately I was not able to find any public facing announcement from the Dynamodb team, and do not want to share incorrect information so I can't really speak to the why right now. I can only refer you to the official documentation but it doesn't provide the reasoning. Once there is an official public wording about it I will definitely share it here.

@pengxiaolong
Thanks for the additional context. I'll make sure to share it with the Corretto team. I see there's an existing PR. I hope the JDK team can also prioritize this work to help out.

Since we rely on the JDK's URI functionality, this is not actionable by the Java SDK team at the moment. Additionally, because this feature is opt-in, I'm going to downgrade the severity. I'll make sure to keep you posted with any new developments.

Thanks,
Ran~

@RanVaknin RanVaknin added blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. p2 This is a standard priority issue and removed blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. p1 This is a high priority issue labels Mar 31, 2025
@Hexcles
Copy link
Author

Hexcles commented Mar 31, 2025 via email

@RanVaknin
Copy link
Contributor

@Hexcles ,

You are right. I meant to say opt-out.

Thanks,
Ran~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Work is blocked on this issue for this codebase. Other labels or comments may indicate why. bug This issue is a bug. p2 This is a standard priority issue potential-regression Marking this issue as a potential regression to be checked by team member
Projects
None yet
Development

No branches or pull requests

5 participants