Skip to content

Do not start redis dev services in augmentation (plus supporting framework) #47610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

holly-cummins
Copy link
Contributor

@holly-cummins holly-cummins commented Apr 29, 2025

Resolves #47602

See also #45785. This is the first of an incremental rollout of a new model, to resolve a problem created by #34681. Doing the change incrementally keeps PRs small and allows feedback on the approach. This also has the advantage of validating that existing ecosystem extensions will continue to work. “Work” needs qualification, since they only work if ports are distinct, which brings us back to #45785. In the existing dev services, assumption that the order of the processors running correlates to the order of the dev services being used (like first static variables) is problematic.

The design

There are two elements to the design:

  • Do not start dev services in the build step
  • Start them at application start
  • Use a runtimeConfigSource to lazily share about ephemeral ports, etc

Most dev service processors check to see if a service is already running, but using static variables, so the check only finds services within the same profile. This isn’t enough post-#34681, because another profile might have started dev services on the same port. Those dev services would ideally be re-used, and if they’re not re-used, they need to be closed. This means we need to track dev services between augmentation classloaders. To do this, I’ve created a tracker build item which tracks across classloaders. I would have liked to inject unique instances into processors but doing that would need a KeyedBuildItem or FeatureBuildItem or some other basic type which got created with state about what it’s getting injected into.

Re-use semantics

The logic for starting dev services is a bit more involved than it used to be, but it’s centralised. It's also more capable. There's better ability to shut down dev services from other profiles without relying on tracking owners, and better ability (but ultimately controlled by dev service processor) to reuse dev services.

When start gets called on a dev services build item, it first checks to see if there’s a running dev service with the right properties. (This could be from a previous continuous testing restart, or another profile. If there’s a match, no further action is needed. If there’s not a match, we need to stop other dev services of the same kind. The tracker helps us find them, because we can’t assume that dev services are started in the same order that augmentation was run. Once all the other containers are stopped, the new container is started and submitted for tracking.

image
(https://excalidraw.com/#json=VvSzzPO1nVD_XdPElaHsF,EHrqqwDzb5H5g5lXFYUk6A)

In the old model, the Redis processor would store state in a static variable and attempt to re-use compatible RunningDevServices, with a boolean restartRequired = !currentDevServicesConfiguration.equals(capturedDevServicesConfiguration); check. In the new model it does a similar check, but also checks across profile boundaries. This check is now non-optional because any extant containers/RDSes that don't get re-used need to get stopped. The ContainerLocator logic isn't changed.

A nice benefit of this approach is that the code in the processors is shorter, which gets us a bit closer to the dream of low-boilerplate dev services. The one icky boilerplate that’s new is the key-value pairs for the uniqueness/re-use criteria for the dev services. This is maybe something to think about for a follow-up PR. I think this approach might also avoid a few restarts we used to see between profiles, which could speed up tests.

Could be better (areas for discussion)

  • The processors identify uniqueness/re-use criteria, based on config. For redis, the port number is automatically considered, and I added code to also look at the image name. I didn’t look at things like timeout, but maybe all config should be included in the key.
  • I made RunnableDevService extend RunningDevService. That seems most logical in terms of the methods (runnable dev services has more classes and methods), but it’s least logical in terms of the names. A third common superclass would solve the naming problem, but make for a lot of classes.
  • @ozangunalp points out that starting dev services sequentially is a performance disaster (although he used nicer words). This is true, but I'd like to fix it in a follow-on item, since at the moment only one devservice uses the new model, and parellising a queue of one is a no-op.

Other approaches considered

I was originally hoping to do a less invasive approach where I just filtered out dev service build items during augmentation, and then ran a custom build with a new build chain including those items. This sort of worked, but getting the right config to the right place wouldn’t have been straightforward, and the second augmentation would have made classloader problems. The current approach means every dev service needs updating, but it does improve the processor code for the adjusted dev services, I think.

@quarkus-bot quarkus-bot bot added area/core area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/redis area/testing labels Apr 29, 2025
@holly-cummins holly-cummins changed the title Framework for not starting dev services in augmentation, plus application to redis Do not start redis dev services in augmentation (plus supporting framework) Apr 29, 2025

This comment has been minimized.

Copy link
Member

@cescoffier cescoffier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss this. We cannot pre-allocate ports like this. We had the same issue with the configuration—this is not reliable.

We need to find a way to allocate the ports randomly when the containers are started. We may need to use a lazy config placeholder or something. What if we start dev services at startup time but introduce a config source before starting the application with the values defined by the dev service, including the mapper port?

@ozangunalp
Copy link
Contributor

Isn't the proper fix for #45786 not to share dev services between different test augmentation runs ?

In that case, both application instances would use their own instance of dev service, without the port clashing issue.

@ozangunalp
Copy link
Contributor

For the port reserving, I was tempted by this for other stuff. It seemed interesting to have something like $quarkus.uuid but for available port. But backed away because of the mentioned issue.

I'd prefer if we kept the container runtime port assignment.

@ozangunalp
Copy link
Contributor

I think such mechanism like dev services tracker build item is the way to go for reducing that boilerplate on that part.

I was waiting for the unified dev service initiative to discuss this.

But not sure if we can do this on the resulting dev service output config.

As dev service input config are (usually) flat objects that hold relevant config and are assigned to a static volatile in the dev service processor class to keep them between augmentation runs.

Maybe we can create and hold that state higher up instead of a build item class, which is needed to be loaded from a common classloader.

@cescoffier
Copy link
Member

I'm going to kickoff the dev service working group today, and associate this with it.

Also, @dmlloyd, @radcortez and I discussed the random port allocation in a config related work.

@holly-cummins
Copy link
Contributor Author

Isn't the proper fix for #45786 not to share dev services between different test augmentation runs ?

In that case, both application instances would use their own instance of dev service, without the port clashing issue.

Yes, and no, and no. It depends what you mean by "share dev services". Dev services is overloaded to mean the external container, the RunningDevService object, and the DevServicesResultBuildItem. Sharing instances of the Java objects is orthogonal to what's being solved here, but sharing those doesn't cause problems. The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port. In the multi-profile case, there will always be more than one dev service (the Java object). If there's more than one dev service, and they all start at the beginning of the test run (the current situation), multiple Java objects will attempt to start multiple containers on the same port, and things fail.

There's a user workaround to 'not share dev services' by ensuring every profile includes configuration which hardcodes a distinct port, but that's not something we want people to have to do.

@holly-cummins
Copy link
Contributor Author

I think such mechanism like dev services tracker build item is the way to go for reducing that boilerplate on that part.

I was waiting for the unified dev service initiative to discuss this.

I've had my first message on zulip saying "I think I found a problem, dev services are behaving weirdly in 3.22," so I think a fix for this is rather urgent (hence me not waiting for the WG).

But not sure if we can do this on the resulting dev service output config.

As dev service input config are (usually) flat objects that hold relevant config and are assigned to a static volatile in the dev service processor class to keep them between augmentation runs.

A static volatile isn't sufficient, because it won't keep them between augmentation classloaders.

Maybe we can create and hold that state higher up instead of a build item class, which is needed to be loaded from a common classloader.

That's more or less the approach I took with the tracker. It holds state on the app classloader.

@ozangunalp
Copy link
Contributor

The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port

I was referring to the external container, but I didn't think of the hardcoded port case. So this wasn't an issue before because different augmentations were on the same classloader?

@ozangunalp
Copy link
Contributor

I've had my first message on zulip saying "I think I found a problem, dev services are behaving weirdly in 3.22," so I think a fix for this is rather urgent (hence me not waiting for the WG).

That is perfectly fine, I wasn't questioning the timing of this PR.

@holly-cummins
Copy link
Contributor Author

The problem happens if a user project uses more than one profile, but hardcodes ports and does not use more than one port

I was referring to the external container, but I didn't think of the hardcoded port case. So this wasn't an issue before because different augmentations were on the same classloader?

The change is the timing of the augmentations. Before, we'd do the augmentation, run the test, then if we switched profile, we'd run the augmentation, run the next test ... Now we do all the augmentations up front. That has a few disadvantages (like the dev service issue), but it means we can get access to the right classloader for the test at the time we need it.

In the old model, as long as a dev service closed itself on application shutdown, the port would be free for the next application, but now it needs to not occupy the port until it needs it.

@quarkus-bot quarkus-bot bot added area/dependencies Pull requests that update a dependency file area/vertx labels Apr 30, 2025
@holly-cummins
Copy link
Contributor Author

I can report that @cescoffier's suggestion of using a config source works perfectly, and was actually pretty easy. I've removed all dodgy pre-creation of sockets, and the random port test is still passing nicely. I've pushed changes in case people want to look at them, but the build will be red, because I added a devservices runtime module, and that breaks everything. (See discussion in zulip.)
The PR is messy, and I also want to look through and see if I can align the approach with what DevServicesLauncherConfigResultBuildItem is already doing. But it's a promising start, and a much more elegant approach.

This comment has been minimized.

This comment has been minimized.

Copy link

github-actions bot commented Apr 30, 2025

🎊 PR Preview f346fe6 has been successfully built and deployed to https://quarkus-pr-main-47610-preview.surge.sh/version/main/guides/

  • Images of blog posts older than 3 months are not available.
  • Newsletters older than 3 months are not available.

@holly-cummins
Copy link
Contributor Author

I've abandoned the idea of a devservices/runtime module and gone for the Mule Module Model instead (I persuade another runtime module to carry around my classes for me). I've also tidied up the code to the point where the code itself is ok. There's probably still scope to refactor and align other parts of the codebase, like DevServicesLauncherConfigResultBuildItem, but the new config source is doing its job.

I've also added a diagram to the main description.

This comment has been minimized.

@holly-cummins holly-cummins marked this pull request as draft April 30, 2025 21:18
@holly-cummins holly-cummins marked this pull request as ready for review April 30, 2025 21:29

This comment has been minimized.

@@ -276,10 +286,19 @@ public void overrideConfig(Map<String, String> config) {
RuntimeOverrideConfigSource.setConfig(runtimeClassLoader, config);
}

private static void startDevServices(List<DevServicesResultBuildItem> devServicesResultBuildItems) {
for (DevServicesResultBuildItem devServicesResultBuildItem : devServicesResultBuildItems) {
devServicesResultBuildItem.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will start services one after the other and take too much time when multiple dev services are involved.
We need to start them in parallel and wait until they are ready.

When services are started at augmentation time, we didn't have the problem because the build steps are called using the build executor service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! As an opposite problem, @cescoffier pointed out that some dev services might have a desired ordering, which would be honoured when they're started as part of a build chain, and not honoured when I just iterate through a collection. But we agreed that (a) people probably shouldn't expect that level of orchestration and (b) if we needed to, we could add a prioritisation mechanism later. Just replicating a priority from the Processor order might be enough.

So I'll switch to doing some kind of parallelisation, and if we need an ordered parallelisation later we can cross that bridge when we come to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if we wanted to, we could be incremental iterate and do that as an (immediately after) follow-on work item. For now only one dev service (Redis) would actually start in the start step, and for all the others, it's a fast no-op. So we'll only see this (definitely important) problem once we start scaling this out to other dev services.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we agreed that (a) people probably shouldn't expect that level of orchestration and (b) if we needed to, we could add a prioritisation mechanism later.

I would say that if they need that, they need to use the Compose Dev Services feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was sort of where Clement and I got to.


if (systemInstance == null) {
try {
Class s = ClassLoader.getSystemClassLoader().loadClass(DevServicesConfigTracker.class.getName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this idea to add the state holder class to the bootstrap module and configure the QuarkusClassloaders to load it parent first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that would certainly make the tracker code itself cleaner. It has the disadvantage that to fully understand the behaviour you'd need to look in two places (the class and also the quarkus classloader). But that slight spreading of information is probably outweighed by improved robustness, less time spent on reflection, and less ugly code. I'll switch to that approach, thanks.


// This config map is what we use for deciding if a container from another profile can be re-used
// TODO ideally the container properties would get put into it in a centralised way, but the RunnableDevService object doesn't get passed detailed information about the container
Map config = new HashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a Map, can we use the input dev service config object (DevServiceConfiguration) to register the service to the tracker?

Copy link
Contributor Author

@holly-cummins holly-cummins May 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I/we need to think about this one. There's two parts.

Classloading

I don't love the map, but I was worried about classes coming across from other classloaders, so I stuck to system constructs. I may have been over-cautious, so I'll spell out the scenario and see if it makes sense.

If profile A runs first, it will create RunnableDevService A, and when start() is called, it will start container a on port 4200,. Then profile B runs, ProcessorB will create RunnableDevServiceB, and RunnableDevServiceB will look into the tracker and find RunnableDevService A. So RDS B will just leave RDS A in the tracker. So then, when we ask for config, we'll be asking RDS A, and RunnableDevService A's config object would be from augmentation classloader A. At the moment I use a Supplier and return a Map so it's cross-classloader safe.

What gets included and config chaos

We have at least six kinds of config that we use in the processor. That's too many, clearly, so reducing it would be good, but we'd need to be careful of side effects if we overload one kind to be another kind. Something definitely needs to be done, if only naming all the kinds to make it less confusing what's going on.

Walking through it for Redis, we have:

  • DevServicesConfig - passed in to the processor, generalised config saying if dev services are enabled at all, and whether to use shared networks
  • RedisBuildTimeConfig - aggregator, allows multiple dev services; not sure how widespread this pattern is
  • RedisClientBuildTimeConfig - mulitiples are aggregated in RedisBuildTimeConfig
  • DevServicesConfig - group within RedisClientBuildTimeConfig, inspected and used to call RDS constructor
  • currentDevServicesConfiguration, Local map used for RDS creation; it's a manual hybrid of elements from RedisBuildTimeConfig and RedisClientBuildTimeConfig
  • Uniqueness identiying config - config (line 173), new in this PR. This is a subset of properties, like image name and hardcoded port, that we use to decide if dev services are compatible. The old model did something similar but on a different line. This is also what's used in the RDS constructor.
  • DynamicConfig. This is new. It's config that we want to declare that we can provide, but we want to provide it lazily. So the left hand side is a string, and the right hand side is a function.

So I think the upshot of all that, for your question, is 'is there anything extra we'd want to put into the DevServicesConfigthat we wouldn't want to use as a uniqueness identifier? And the opposite, is there anything we'd want in the uniqueness identifier that we wouldn't want in config? For example, we could force unique instances every time by putting a uuid into the map, although that's a bit clumsy so we might not want to design for it.

Equals workarounds

In an earlier version I did put the typed interfaces on the tracker, but I needed to have a reflective equality checker to work around the classloader issue. I decided map was nicer.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@holly-cummins
Copy link
Contributor Author

@ozangunalp I've done your suggestion of using the QuarkusClassLoader to force stuff onto the system classloader, and it makes the code much nicer, thank you! I haven't looked at parallel execution yet because it's a no-op in the current implementation where only the redis dev service could be in the queue. So I'd prefer to merge this and then do paralleisation immediately after, or maybe once I've converted one more service, so there are things to test with.

@holly-cummins holly-cummins requested a review from ozangunalp May 2, 2025 13:14
Copy link

quarkus-bot bot commented May 2, 2025

Status for workflow Quarkus Documentation CI

This is the status report for running Quarkus Documentation CI on commit ac1a817.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

Warning

There are other workflow runs running, you probably need to wait for their status before merging.

Copy link

quarkus-bot bot commented May 2, 2025

Status for workflow Quarkus CI

This is the status report for running Quarkus CI on commit ac1a817.

✅ The latest workflow run for the pull request has completed successfully.

It should be safe to merge provided you have a look at the other checks in the summary.

You can consult the Develocity build scans.


Flaky tests - Develocity

⚙️ JVM Tests - JDK 17

📦 extensions/infinispan-cache/deployment

io.quarkus.cache.infinispan.InfinispanCacheTest.testGetAsyncWithParallelCalls - History

  • expected: "thread1" but was: "thread2" - org.opentest4j.AssertionFailedError
org.opentest4j.AssertionFailedError: 

expected: "thread1"
 but was: "thread2"
	at io.quarkus.cache.infinispan.InfinispanCacheTest.testGetAsyncWithParallelCalls(InfinispanCacheTest.java:283)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at io.quarkus.test.QuarkusUnitTest.runExtensionMethod(QuarkusUnitTest.java:521)
	at io.quarkus.test.QuarkusUnitTest.interceptTestMethod(QuarkusUnitTest.java:435)

@ozangunalp
Copy link
Contributor

@holly-cummins So I wrote a long comment yesterday, but I now realise it wasn't posted. Sorry for that...

I verified that the parent-first classloader change works as expected. I thought we could configure different classloaders via addParentFirstElement, but this is more foolproof :)

However, there are still some things not working as expected:

  • Config changes are not picked to refresh the service when needed. It always falls back to the shared service lookup.
  • Closing running services does not look at which application created the service. For example, when continuous testing runs, it tears down the dev mode service.
  • I think overall we are sharing services in more "uncontrolled" places, and closing down services we should not.

I've a branch where I experimented with some stuff, I'll share it as soon as I clean it up.

@holly-cummins
Copy link
Contributor Author

@holly-cummins So I wrote a long comment yesterday, but I now realise it wasn't posted. Sorry for that...

Oddly, I think GitHub emailed me, but didn't put anything on the issue itself.

I verified that the parent-first classloader change works as expected. I thought we could configure different classloaders via addParentFirstElement, but this is more foolproof :)

Yeah, it's kind of ugly :), but it does what we want more reliably than the existing mechanisms based on classpath elements.

However, there are still some things not working as expected:

* Config changes are not picked to refresh the service when needed. It always falls back to the shared service lookup.

Oh dear. Is this continuous testing? And the badness is that instead of going "I will shut down and restart because I've changed," it goes "oh no, nothing matches, I can't start one, I'll try and find a shared one?"
In the old paths, I guess this was the 'current' and previous checks, but I don't think we need that static state for it to work. It should be sufficient to say 'nothing matches my current config, that's bad, I will start a service' ... which is kind of what you'd hope it was already doing in the non-continuous case anyway, so I'm puzzled about what could be happening.

* Closing running services does not look at which application created the service. For example, when continuous testing runs, it tears down the dev mode service.

Doh! Good spot.

* I think overall we are sharing services in more "uncontrolled" places, and closing down services we should not.

That seems plausible. The rule for 'can I close down this service' has to change to not be anything to do with 'did I create this service?', because we can't rely on chronology. I'm tempted to deprecate the isOwner method entirely once this rolls out to a few more services, because the answers will almost never actually be useful, and they could mislead extension authors into bad decisions. I think there's (maybe) three decision points/levels of re-use.

  • One is owned by extensions, and it's "does the config of this match my config?" The redis service did that decision already, although I might have changed the definition of 'match' slightly. The reason it's owned by the extension is that the definition of 'match' might vary a lot by extension ... some services might be stricter than others. But having the extensions do the heavy lifting is a banana skin, so it might be better to centralise.
  • One is what I think you're pointing out I forgot, which is 'is this service in the same mode/application as me?' This wouldn't vary by extension so should be centralised.
  • And the final one is unchanged, which is 'was I configured to allow re-use and is there an external service with the right name?' That's also one where the extension does the heavy lifting.

I think your other comment mentioned continuous testing, too. I added tests, but not enough, because I didn't add any for continuous testing. :(

Your other comment also mentioned that things can work by luck because the container locator finds the service. (I think that's what it meant). That seems like a nasty gotcha that could trick good tests into giving bad answers. Although that's kind of a complicated one since the redis processor invoke the container locator service first, I think.

I've a branch where I experimented with some stuff, I'll share it as soon as I clean it up.

Cool! My first instinct is to run off and debug, but I won't touch anything until then, so we don't duplicate work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core area/dependencies Pull requests that update a dependency file area/devtools Issues/PR related to maven, gradle, platform and cli tooling/plugins area/redis area/testing area/vertx triage/flaky-test
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Redis services should not be started during augmentation phase
5 participants