Do not start redis dev services in augmentation (plus supporting framework) #47610

holly-cummins · 2025-04-29T20:17:22Z

Resolves #47602

See also #45785. This is the first of an incremental rollout of a new model, to resolve a problem created by #34681. Doing the change incrementally keeps PRs small and allows feedback on the approach. This also has the advantage of validating that existing ecosystem extensions will continue to work. “Work” needs qualification, since they only work if ports are distinct, which brings us back to #45785. In the existing dev services, assumption that the order of the processors running correlates to the order of the dev services being used (like first static variables) is problematic.

The design

There are two elements to the design:

Do not start dev services in the build step
Start them at application start
Use a runtimeConfigSource to lazily share about ephemeral ports, etc

Most dev service processors check to see if a service is already running, but using static variables, so the check only finds services within the same profile. This isn’t enough post-#34681, because another profile might have started dev services on the same port. Those dev services would ideally be re-used, and if they’re not re-used, they need to be closed. This means we need to track dev services between augmentation classloaders. To do this, I’ve created a tracker build item which tracks across classloaders. I would have liked to inject unique instances into processors but doing that would need a KeyedBuildItem or FeatureBuildItem or some other basic type which got created with state about what it’s getting injected into.

Re-use semantics

The logic for starting dev services is a bit more involved than it used to be, but it’s centralised. It's also more capable. There's better ability to shut down dev services from other profiles without relying on tracking owners, and better ability (but ultimately controlled by dev service processor) to reuse dev services.

When start gets called on a dev services build item, it first checks to see if there’s a running dev service with the right properties. (This could be from a previous continuous testing restart, or another profile. If there’s a match, no further action is needed. If there’s not a match, we need to stop other dev services of the same kind. The tracker helps us find them, because we can’t assume that dev services are started in the same order that augmentation was run. Once all the other containers are stopped, the new container is started and submitted for tracking.

(https://excalidraw.com/#json=VvSzzPO1nVD_XdPElaHsF,EHrqqwDzb5H5g5lXFYUk6A)

In the old model, the Redis processor would store state in a static variable and attempt to re-use compatible RunningDevServices, with a boolean restartRequired = !currentDevServicesConfiguration.equals(capturedDevServicesConfiguration); check. In the new model it does a similar check, but also checks across profile boundaries. This check is now non-optional because any extant containers/RDSes that don't get re-used need to get stopped. The ContainerLocator logic isn't changed.

A nice benefit of this approach is that the code in the processors is shorter, which gets us a bit closer to the dream of low-boilerplate dev services. The one icky boilerplate that’s new is the key-value pairs for the uniqueness/re-use criteria for the dev services. This is maybe something to think about for a follow-up PR. I think this approach might also avoid a few restarts we used to see between profiles, which could speed up tests.

Could be better (areas for discussion)

The processors identify uniqueness/re-use criteria, based on config. For redis, the port number is automatically considered, and I added code to also look at the image name. I didn’t look at things like timeout, but maybe all config should be included in the key.
I made RunnableDevService extend RunningDevService. That seems most logical in terms of the methods (runnable dev services has more classes and methods), but it’s least logical in terms of the names. A third common superclass would solve the naming problem, but make for a lot of classes.
@ozangunalp points out that starting dev services sequentially is a performance disaster (although he used nicer words). This is true, but I'd like to fix it in a follow-on item, since at the moment only one devservice uses the new model, and parellising a queue of one is a no-op.

Other approaches considered

I was originally hoping to do a less invasive approach where I just filtered out dev service build items during augmentation, and then ran a custom build with a new build chain including those items. This sort of worked, but getting the right config to the right place wouldn’t have been straightforward, and the second augmentation would have made classloader problems. The current approach means every dev service needs updating, but it does improve the processor code for the adjusted dev services, I think.

…mplementation on Redis

cescoffier

We need to discuss this. We cannot pre-allocate ports like this. We had the same issue with the configuration—this is not reliable.

We need to find a way to allocate the ports randomly when the containers are started. We may need to use a lazy config placeholder or something. What if we start dev services at startup time but introduce a config source before starting the application with the values defined by the dev service, including the mapper port?

core/deployment/src/main/java/io/quarkus/deployment/builditem/DevServicesTrackerBuildItem.java

extensions/devservices/common/src/main/java/io/quarkus/devservices/common/LazyContainer.java

ozangunalp · 2025-04-30T06:06:31Z

Isn't the proper fix for #45786 not to share dev services between different test augmentation runs ?

In that case, both application instances would use their own instance of dev service, without the port clashing issue.

ozangunalp · 2025-04-30T06:11:39Z

For the port reserving, I was tempted by this for other stuff. It seemed interesting to have something like $quarkus.uuid but for available port. But backed away because of the mentioned issue.

I'd prefer if we kept the container runtime port assignment.

ozangunalp · 2025-04-30T06:36:52Z

I think such mechanism like dev services tracker build item is the way to go for reducing that boilerplate on that part.

I was waiting for the unified dev service initiative to discuss this.

But not sure if we can do this on the resulting dev service output config.

As dev service input config are (usually) flat objects that hold relevant config and are assigned to a static volatile in the dev service processor class to keep them between augmentation runs.

Maybe we can create and hold that state higher up instead of a build item class, which is needed to be loaded from a common classloader.

cescoffier · 2025-04-30T06:59:32Z

I'm going to kickoff the dev service working group today, and associate this with it.

Also, @dmlloyd, @radcortez and I discussed the random port allocation in a config related work.

holly-cummins · 2025-04-30T08:26:29Z

Isn't the proper fix for #45786 not to share dev services between different test augmentation runs ?

In that case, both application instances would use their own instance of dev service, without the port clashing issue.

Yes, and no, and no. It depends what you mean by "share dev services". Dev services is overloaded to mean the external container, the RunningDevService object, and the DevServicesResultBuildItem. Sharing instances of the Java objects is orthogonal to what's being solved here, but sharing those doesn't cause problems. The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port. In the multi-profile case, there will always be more than one dev service (the Java object). If there's more than one dev service, and they all start at the beginning of the test run (the current situation), multiple Java objects will attempt to start multiple containers on the same port, and things fail.

There's a user workaround to 'not share dev services' by ensuring every profile includes configuration which hardcodes a distinct port, but that's not something we want people to have to do.

holly-cummins · 2025-04-30T08:29:22Z

I think such mechanism like dev services tracker build item is the way to go for reducing that boilerplate on that part.

I was waiting for the unified dev service initiative to discuss this.

I've had my first message on zulip saying "I think I found a problem, dev services are behaving weirdly in 3.22," so I think a fix for this is rather urgent (hence me not waiting for the WG).

But not sure if we can do this on the resulting dev service output config.

As dev service input config are (usually) flat objects that hold relevant config and are assigned to a static volatile in the dev service processor class to keep them between augmentation runs.

A static volatile isn't sufficient, because it won't keep them between augmentation classloaders.

Maybe we can create and hold that state higher up instead of a build item class, which is needed to be loaded from a common classloader.

That's more or less the approach I took with the tracker. It holds state on the app classloader.

ozangunalp · 2025-04-30T08:47:03Z

The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port

I was referring to the external container, but I didn't think of the hardcoded port case. So this wasn't an issue before because different augmentations were on the same classloader?

ozangunalp · 2025-04-30T08:48:25Z

I've had my first message on zulip saying "I think I found a problem, dev services are behaving weirdly in 3.22," so I think a fix for this is rather urgent (hence me not waiting for the WG).

That is perfectly fine, I wasn't questioning the timing of this PR.

holly-cummins · 2025-04-30T08:53:20Z

The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port

I was referring to the external container, but I didn't think of the hardcoded port case. So this wasn't an issue before because different augmentations were on the same classloader?

The change is the timing of the augmentations. Before, we'd do the augmentation, run the test, then if we switched profile, we'd run the augmentation, run the next test ... Now we do all the augmentations up front. That has a few disadvantages (like the dev service issue), but it means we can get access to the right classloader for the test at the time we need it.

In the old model, as long as a dev service closed itself on application shutdown, the port would be free for the next application, but now it needs to not occupy the port until it needs it.

holly-cummins · 2025-04-30T18:11:27Z

I can report that @cescoffier's suggestion of using a config source works perfectly, and was actually pretty easy. I've removed all dodgy pre-creation of sockets, and the random port test is still passing nicely. I've pushed changes in case people want to look at them, but the build will be red, because I added a devservices runtime module, and that breaks everything. (See discussion in zulip.)
The PR is messy, and I also want to look through and see if I can align the approach with what DevServicesLauncherConfigResultBuildItem is already doing. But it's a promising start, and a much more elegant approach.

github-actions · 2025-04-30T18:51:24Z

🎊 PR Preview f346fe6 has been successfully built and deployed to https://quarkus-pr-main-47610-preview.surge.sh/version/main/guides/

Images of blog posts older than 3 months are not available.
Newsletters older than 3 months are not available.

holly-cummins · 2025-04-30T20:52:43Z

I've abandoned the idea of a devservices/runtime module and gone for the Mule Module Model instead (I persuade another runtime module to carry around my classes for me). I've also tidied up the code to the point where the code itself is ok. There's probably still scope to refactor and align other parts of the codebase, like DevServicesLauncherConfigResultBuildItem, but the new config source is doing its job.

I've also added a diagram to the main description.

…ll quarkus classes

ozangunalp · 2025-05-01T08:39:15Z

core/deployment/src/main/java/io/quarkus/runner/bootstrap/StartupActionImpl.java

@@ -276,10 +286,19 @@ public void overrideConfig(Map<String, String> config) {
        RuntimeOverrideConfigSource.setConfig(runtimeClassLoader, config);
    }

+    private static void startDevServices(List<DevServicesResultBuildItem> devServicesResultBuildItems) {
+        for (DevServicesResultBuildItem devServicesResultBuildItem : devServicesResultBuildItems) {
+            devServicesResultBuildItem.start();


This will start services one after the other and take too much time when multiple dev services are involved.
We need to start them in parallel and wait until they are ready.

When services are started at augmentation time, we didn't have the problem because the build steps are called using the build executor service.

Good point! As an opposite problem, @cescoffier pointed out that some dev services might have a desired ordering, which would be honoured when they're started as part of a build chain, and not honoured when I just iterate through a collection. But we agreed that (a) people probably shouldn't expect that level of orchestration and (b) if we needed to, we could add a prioritisation mechanism later. Just replicating a priority from the Processor order might be enough.

So I'll switch to doing some kind of parallelisation, and if we need an ordered parallelisation later we can cross that bridge when we come to it.

Actually, if we wanted to, we could be incremental iterate and do that as an (immediately after) follow-on work item. For now only one dev service (Redis) would actually start in the start step, and for all the others, it's a fast no-op. So we'll only see this (definitely important) problem once we start scaling this out to other dev services.

But we agreed that (a) people probably shouldn't expect that level of orchestration and (b) if we needed to, we could add a prioritisation mechanism later.

I would say that if they need that, they need to use the Compose Dev Services feature.

Yeah, that was sort of where Clement and I got to.

ozangunalp · 2025-05-01T08:44:26Z

core/runtime/src/main/java/io/quarkus/devservices/runtime/DevServicesConfigTracker.java

+
+        if (systemInstance == null) {
+            try {
+                Class s = ClassLoader.getSystemClassLoader().loadClass(DevServicesConfigTracker.class.getName());


I had this idea to add the state holder class to the bootstrap module and configure the QuarkusClassloaders to load it parent first.

Oh, that would certainly make the tracker code itself cleaner. It has the disadvantage that to fully understand the behaviour you'd need to look in two places (the class and also the quarkus classloader). But that slight spreading of information is probably outweighed by improved robustness, less time spent on reflection, and less ugly code. I'll switch to that approach, thanks.

ozangunalp · 2025-05-01T08:59:08Z

...t/deployment/src/main/java/io/quarkus/redis/deployment/client/DevServicesRedisProcessor.java

+
+            // This config map is what we use for deciding if a container from another profile can be re-used
+            // TODO ideally the container properties would get put into it in a centralised way, but the RunnableDevService object doesn't get passed detailed information about the container
+            Map config = new HashMap();


Instead of a Map, can we use the input dev service config object (DevServiceConfiguration) to register the service to the tracker?

Good question. I/we need to think about this one. There's two parts.

Classloading

I don't love the map, but I was worried about classes coming across from other classloaders, so I stuck to system constructs. I may have been over-cautious, so I'll spell out the scenario and see if it makes sense.

If profile A runs first, it will create RunnableDevService A, and when start() is called, it will start container a on port 4200,. Then profile B runs, ProcessorB will create RunnableDevServiceB, and RunnableDevServiceB will look into the tracker and find RunnableDevService A. So RDS B will just leave RDS A in the tracker. So then, when we ask for config, we'll be asking RDS A, and RunnableDevService A's config object would be from augmentation classloader A. At the moment I use a Supplier and return a Map so it's cross-classloader safe.

What gets included and config chaos

We have at least six kinds of config that we use in the processor. That's too many, clearly, so reducing it would be good, but we'd need to be careful of side effects if we overload one kind to be another kind. Something definitely needs to be done, if only naming all the kinds to make it less confusing what's going on.

Walking through it for Redis, we have:

DevServicesConfig - passed in to the processor, generalised config saying if dev services are enabled at all, and whether to use shared networks

RedisBuildTimeConfig - aggregator, allows multiple dev services; not sure how widespread this pattern is

RedisClientBuildTimeConfig - mulitiples are aggregated in RedisBuildTimeConfig

DevServicesConfig - group within RedisClientBuildTimeConfig, inspected and used to call RDS constructor

currentDevServicesConfiguration, Local map used for RDS creation; it's a manual hybrid of elements from RedisBuildTimeConfig and RedisClientBuildTimeConfig

Uniqueness identiying config - config (line 173), new in this PR. This is a subset of properties, like image name and hardcoded port, that we use to decide if dev services are compatible. The old model did something similar but on a different line. This is also what's used in the RDS constructor.

DynamicConfig. This is new. It's config that we want to declare that we can provide, but we want to provide it lazily. So the left hand side is a string, and the right hand side is a function.

So I think the upshot of all that, for your question, is 'is there anything extra we'd want to put into the DevServicesConfigthat we wouldn't want to use as a uniqueness identifier? And the opposite, is there anything we'd want in the uniqueness identifier that we wouldn't want in config? For example, we could force unique instances every time by putting a uuid into the map, although that's a bit clumsy so we might not want to design for it.

Equals workarounds

In an earlier version I did put the typed interfaces on the tracker, but I needed to have a reflective equality checker to work around the classloader issue. I decided map was nicer.

…gestion for avoiding reflection

holly-cummins · 2025-05-02T13:14:10Z

@ozangunalp I've done your suggestion of using the QuarkusClassLoader to force stuff onto the system classloader, and it makes the code much nicer, thank you! I haven't looked at parallel execution yet because it's a no-op in the current implementation where only the redis dev service could be in the queue. So I'd prefer to merge this and then do paralleisation immediately after, or maybe once I've converted one more service, so there are things to test with.

quarkus-bot · 2025-05-02T13:48:36Z

Status for workflow `Quarkus Documentation CI`

This is the status report for running Quarkus Documentation CI on commit ac1a817.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

Warning

There are other workflow runs running, you probably need to wait for their status before merging.

quarkus-bot · 2025-05-02T16:41:14Z

Status for workflow `Quarkus CI`

This is the status report for running Quarkus CI on commit ac1a817.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.

Flaky tests - Develocity

⚙️ JVM Tests - JDK 17

📦 extensions/infinispan-cache/deployment

✖ io.quarkus.cache.infinispan.InfinispanCacheTest.testGetAsyncWithParallelCalls - History

expected: "thread1" but was: "thread2" - org.opentest4j.AssertionFailedError

org.opentest4j.AssertionFailedError: 

expected: "thread1"
 but was: "thread2"
	at io.quarkus.cache.infinispan.InfinispanCacheTest.testGetAsyncWithParallelCalls(InfinispanCacheTest.java:283)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at io.quarkus.test.QuarkusUnitTest.runExtensionMethod(QuarkusUnitTest.java:521)
	at io.quarkus.test.QuarkusUnitTest.interceptTestMethod(QuarkusUnitTest.java:435)

ozangunalp · 2025-05-06T11:06:47Z

@holly-cummins So I wrote a long comment yesterday, but I now realise it wasn't posted. Sorry for that...

I verified that the parent-first classloader change works as expected. I thought we could configure different classloaders via addParentFirstElement, but this is more foolproof :)

However, there are still some things not working as expected:

Config changes are not picked to refresh the service when needed. It always falls back to the shared service lookup.
Closing running services does not look at which application created the service. For example, when continuous testing runs, it tears down the dev mode service.
I think overall we are sharing services in more "uncontrolled" places, and closing down services we should not.

I've a branch where I experimented with some stuff, I'll share it as soon as I clean it up.

holly-cummins · 2025-05-06T13:57:21Z

@holly-cummins So I wrote a long comment yesterday, but I now realise it wasn't posted. Sorry for that...

Oddly, I think GitHub emailed me, but didn't put anything on the issue itself.

I verified that the parent-first classloader change works as expected. I thought we could configure different classloaders via addParentFirstElement, but this is more foolproof :)

Yeah, it's kind of ugly :), but it does what we want more reliably than the existing mechanisms based on classpath elements.

However, there are still some things not working as expected:
* Config changes are not picked to refresh the service when needed. It always falls back to the shared service lookup.

Oh dear. Is this continuous testing? And the badness is that instead of going "I will shut down and restart because I've changed," it goes "oh no, nothing matches, I can't start one, I'll try and find a shared one?"
In the old paths, I guess this was the 'current' and previous checks, but I don't think we need that static state for it to work. It should be sufficient to say 'nothing matches my current config, that's bad, I will start a service' ... which is kind of what you'd hope it was already doing in the non-continuous case anyway, so I'm puzzled about what could be happening.

* Closing running services does not look at which application created the service. For example, when continuous testing runs, it tears down the dev mode service.

Doh! Good spot.

* I think overall we are sharing services in more "uncontrolled" places, and closing down services we should not.

That seems plausible. The rule for 'can I close down this service' has to change to not be anything to do with 'did I create this service?', because we can't rely on chronology. I'm tempted to deprecate the isOwner method entirely once this rolls out to a few more services, because the answers will almost never actually be useful, and they could mislead extension authors into bad decisions. I think there's (maybe) three decision points/levels of re-use.

One is owned by extensions, and it's "does the config of this match my config?" The redis service did that decision already, although I might have changed the definition of 'match' slightly. The reason it's owned by the extension is that the definition of 'match' might vary a lot by extension ... some services might be stricter than others. But having the extensions do the heavy lifting is a banana skin, so it might be better to centralise.
One is what I think you're pointing out I forgot, which is 'is this service in the same mode/application as me?' This wouldn't vary by extension so should be centralised.
And the final one is unchanged, which is 'was I configured to allow re-use and is there an external service with the right name?' That's also one where the extension does the heavy lifting.

I think your other comment mentioned continuous testing, too. I added tests, but not enough, because I didn't add any for continuous testing. :(

Your other comment also mentioned that things can work by luck because the container locator finds the service. (I think that's what it meant). That seems like a nasty gotcha that could trick good tests into giving bad answers. Although that's kind of a complicated one since the redis processor invoke the container locator service first, I think.

I've a branch where I experimented with some stuff, I'll share it as soon as I clean it up.

Cool! My first instinct is to run off and debug, but I won't touch anything until then, so we don't duplicate work.

Framework for not starting dev services in augmentation, plus first i…

e110ff7

…mplementation on Redis

quarkus-bot bot added area/core area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/redis area/testing labels Apr 29, 2025

holly-cummins changed the title ~~Framework for not starting dev services in augmentation, plus application to redis~~ Do not start redis dev services in augmentation (plus supporting framework) Apr 29, 2025

holly-cummins requested review from ozangunalp and cescoffier April 29, 2025 21:04

This comment has been minimized.

Sign in to view

quarkus-bot bot added the triage/flaky-test label Apr 29, 2025

cescoffier requested changes Apr 30, 2025

View reviewed changes

holly-cummins added 2 commits April 30, 2025 12:05

Drop unused argument

d4c9967

Switch to using ConfigSource for lazy config

4bdfde5

quarkus-bot bot added area/dependencies Pull requests that update a dependency file area/vertx labels Apr 30, 2025

This comment has been minimized.

Sign in to view

holly-cummins added 2 commits April 30, 2025 21:38

Abandon devservices/runtime module, clean up ConfigSource implementation

ff0ec4d

Add comment

98c526a

holly-cummins requested a review from cescoffier April 30, 2025 20:50

This comment has been minimized.

Sign in to view

holly-cummins marked this pull request as draft April 30, 2025 21:18

Tolerate cases where the system classloader does not have access to a…

18f4f71

…ll quarkus classes

holly-cummins marked this pull request as ready for review April 30, 2025 21:29

This comment has been minimized.

Sign in to view

ozangunalp requested changes May 1, 2025

View reviewed changes

Drop priority

66f0ee5

This comment has been minimized.

Sign in to view

Sort out ConfigSource alternative for native builds, half-do Ozan sug…

2b57db7

…gestion for avoiding reflection

This comment has been minimized.

Sign in to view

Correct classloading problem

621eae7

This comment has been minimized.

Sign in to view

Finish Ozan's suggestion about classloading

ac1a817

holly-cummins requested a review from ozangunalp May 2, 2025 13:14

cescoffier added this to WG - Dev Services Lifecycle May 5, 2025

cescoffier moved this to In Progress in WG - Dev Services Lifecycle May 5, 2025

Do not start redis dev services in augmentation (plus supporting framework) #47610

Are you sure you want to change the base?

Do not start redis dev services in augmentation (plus supporting framework) #47610

Conversation

holly-cummins commented Apr 29, 2025 • edited Loading

The design

Re-use semantics

Could be better (areas for discussion)

Other approaches considered

This comment has been minimized.

cescoffier left a comment

Choose a reason for hiding this comment

ozangunalp commented Apr 30, 2025

ozangunalp commented Apr 30, 2025

ozangunalp commented Apr 30, 2025

cescoffier commented Apr 30, 2025

holly-cummins commented Apr 30, 2025

holly-cummins commented Apr 30, 2025

ozangunalp commented Apr 30, 2025

ozangunalp commented Apr 30, 2025

holly-cummins commented Apr 30, 2025

holly-cummins commented Apr 30, 2025

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Apr 30, 2025 • edited Loading

holly-cummins commented Apr 30, 2025

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holly-cummins May 1, 2025 • edited Loading

Choose a reason for hiding this comment

Classloading

What gets included and config chaos

Equals workarounds

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

holly-cummins commented May 2, 2025

quarkus-bot bot commented May 2, 2025

Status for workflow Quarkus Documentation CI

quarkus-bot bot commented May 2, 2025 • edited by github-actions bot Loading

Status for workflow Quarkus CI

Flaky tests - Develocity

⚙️ JVM Tests - JDK 17

📦 extensions/infinispan-cache/deployment

ozangunalp commented May 6, 2025

holly-cummins commented May 6, 2025

holly-cummins commented Apr 29, 2025 •

edited

Loading

github-actions bot commented Apr 30, 2025 •

edited

Loading

holly-cummins May 1, 2025 •

edited

Loading

Status for workflow `Quarkus Documentation CI`

quarkus-bot bot commented May 2, 2025 •

edited by github-actions bot

Loading

Status for workflow `Quarkus CI`