Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump the Cruise Control version to 2.5.142 #11288

Merged
merged 5 commits into from
Apr 8, 2025
Merged

Conversation

ShubhamRwt
Copy link
Contributor

@ShubhamRwt ShubhamRwt commented Mar 25, 2025

Type of change

Select the type of your PR

  • Task

Description

This PR fixes #11199. This PR bumps Cruise Control to 2.5.142 which is compatible with Kafka 4.0.0

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@im-konge im-konge added this to the 0.46.0 milestone Mar 25, 2025
Copy link
Member

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @ShubhamRwt! Could you add the link of the related issue in the PR description too?

@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented Mar 26, 2025

@ppatierno @im-konge Hi, Can we re-run the failed regression. I am not sure if its related to this PR

@im-konge
Copy link
Member

@ShubhamRwt the test that is failing is the one that you enabled again for the CC. Could you please check it? Thanks.

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented Mar 27, 2025

Hi, the failure on this tests seems to be unrelated to Cruise Control. The CC is pod is not deployed during this whole test. It seems like some logging changes in kafka 4.0.0 is triggering this failure.

I spent a lot of time today to understand the failure. It looks like the pods are suppose to roll when we change the specification of logging to from no logging related configuration to new external logging but with Kafka 4.0.0 there is no rolling of the pods happening and I don't see any error in the logs.

I also noticed that in some places we have comments like Kafka 3.9 -> rolling update is expected only with Kafka 3.9 for eg in test testDynamicallySetUnknownKafkaLogger().

I have multiple things to ask with respect to the observations:

  1. It would be great to know if thats the case here also(i.e no roll of pods expected)

  2. Do we need the logging configuration changes we have for kafka 4.0.0 i.e Creating an if-else to use the logging configuration based upon the kafka version 3.9.0 or 4.0.0.

One other question is should we have a separate issue to track this problem and we can merge this PR since its not associated with failure.

@ppatierno @im-konge @kyguy

@im-konge
Copy link
Member

As I mentioned offline (and as you said in the comment as well), if it's not related to the CC and you don't want to fix this now, you can create an issue so someone else will have a look at it.
Maybe it was disabled because of some other issue and it shouldn't be run with Kafka 4.0.0 - I didn't have time to have a look at the test and what is the issue.

So yeah, in order to move this PR forward, I would just create a new issue, write there what you observed, add it to the assumption comment and leave it disabled for now. For example me or someone else will have a look at it later.

@ShubhamRwt
Copy link
Contributor Author

@im-konge I have created the issue #11312 for tracking the failing test and disabled the test here with correct issue and comment

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented Apr 1, 2025

@im-konge Hi, Can you run the regressions again on this, whenever you get some time? Thanks

@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented Apr 2, 2025

The failure seems to be related to CC sending request to old pod rather then sending the request to new Pod. This is something which is coming up a lot lately even in the community. Maybe we should have some discussion around it on how we can tackle this. But again I guess this is something not intended to fix along with this PR

@kyguy
Copy link
Member

kyguy commented Apr 2, 2025

Is it a test that we have had issues with before? Is it something we can reproduce locally?

The failure seems to be related to CC sending request to old pod rather then sending the request to new Pod. This is something which is coming up a lot lately even in the community.

Looking at the error, it appears to be related to the issue here [1], I assume this is the issue from the community you are mentioning, correct?

2025-04-01T16:29:08.2744375Z 2025-04-01 16:29:08 [main] INFO  [DeploymentUtils:183] Deployment: multiple-co-cluster-test/cluster-548f3edf-cruise-control is ready
2025-04-01T16:29:08.2745885Z 2025-04-01 16:29:08 [main] INFO  [DeploymentUtils:135] Deployment: multiple-co-cluster-test/cluster-548f3edf-cruise-control rolling update finished
2025-04-01T16:29:08.2766751Z 2025-04-01 16:29:08 [main] INFO  [KafkaUtils:159] Waiting for cluster stability
2025-04-01T16:30:11.1372669Z 2025-04-01 16:30:10 [main] INFO  [KafkaUtils:190] Kafka cluster is stable after: 61 polls
2025-04-01T16:30:11.1377649Z 2025-04-01 16:30:10 [main] INFO  [KafkaRebalanceUtils:83] ============================================================================
2025-04-01T16:30:11.1378296Z 2025-04-01 16:30:10 [main] INFO  [KafkaRebalanceUtils:84] NotReady
2025-04-01T16:30:11.1379151Z 2025-04-01 16:30:10 [main] INFO  [KafkaRebalanceUtils:85] ============================================================================
2025-04-01T16:30:11.1379664Z 2025-04-01 16:30:10 [main] INFO  [KafkaRebalanceUtils:90] Verifying that KafkaRebalance resource is in ProposalReady state
2025-04-01T16:30:11.1380268Z 2025-04-01 16:30:10 [main] INFO  [ResourceManager:493] Waiting for KafkaRebalance: multiple-co-cluster-test/cluster-548f3edf will have desired state: ProposalReady
2025-04-01T16:44:10.7412373Z 2025-04-01 16:44:10 [main] INFO  [ResourceManager:467] KafkaRebalance status:
2025-04-01T16:44:10.7413146Z 
2025-04-01T16:44:10.7413365Z Conditions:
2025-04-01T16:44:10.7413560Z 	Type: NotReady
2025-04-01T16:44:10.7414425Z 	Message: Error for request: cluster-548f3edf-cruise-control.multiple-co-cluster-test.svc:9090/kafkacruisecontrol/rebalance?json=true&dryrun=true&verbose=true&skip_hard_goal_check=true&goals=DiskCapacityGoal%2CCpuCapacityGoal%2CNetworkInboundCapacityGoal%2CMinTopicLeadersPerBrokerGoal%2CNetworkOutboundCapacityGoal%2CReplicaCapacityGoal&rebalance_disk=false. Server returned: Error processing POST request '/rebalance' due to: 'com.linkedin.kafka.cruisecontrol.exception.KafkaCruiseControlException: java.lang.NullPointerException: Cannot invoke "com.linkedin.kafka.cruisecontrol.config.BrokerCapacityInfo.capacity()" because the return value of "java.util.Map.get(Object)" is null'.

If we've determined that this issue is unrelated to this PR, could we consider one of the following options?

(A) Keep this test disabled, as we are already re-enabling it as part of this PR. We can address this specific issue in another PR.

(B) Fix the system test - perhaps by adding a line to refresh the KafkaRebalance resource after the new Cruise Control pod starts up.

[1] #11195

@ShubhamRwt
Copy link
Contributor Author

@kyguy Hi, I have tried the 2nd solution you mentioned last week and it works fine. But I guess some users are having this issue even with autorebalance etc [1]. I think going with option 2 can be good for now but we should definitely revisit this problem on some discussion/call.

[1] #11296

@kyguy
Copy link
Member

kyguy commented Apr 3, 2025

I think going with option 2 can be good for now but we should definitely revisit this problem on some discussion/call.

Did you already make the update to this PR? Or did you want to discuss it further before making the changes?

@ShubhamRwt
Copy link
Contributor Author

@kyguy not yet, I was hoping to get views from @im-konge @ppatierno on this

@im-konge
Copy link
Member

im-konge commented Apr 3, 2025

@ShubhamRwt it failed for the first time I guess or? Would adding the annotation fix the issue? I was referring in the issue you and @kyguy mentioned to little bit different problem, which seems to be not related.

TBH I don't really like the idea of just disabling each of the test and creating a lot of issues when you can have a look and fix it by one line (if it's the case). Same for the comment of the logging test -> you can just change the comment there that when Kafka 3.9.x will be dropped, we can remove the test -> and you don't need to leave it on me or someone else.

The issue there about failing test is there for quite a long time, so you can try the refresh annotation if it really fix something, but anyway if it fails on the CC rolling update (as I mentioned in the issue comment), someone will actually have a look at it.

@ShubhamRwt
Copy link
Contributor Author

I agree with you @im-konge, Lets for now add the refresh annotation and see how effective it is. Thanks

Comment on lines 281 to 304
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-server</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-server-common</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-metadata</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-raft</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-group-coordinator</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-storage</artifactId>
</exclusion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with these exclusions? They do not seem to be a dependcies in the first place, so why exclude them?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we build the operator and push the images, I was getting a lot collisions w.r.t these dependencies/jars

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see some of the dependencies added in the build.gradle file for CC ->

    api "org.apache.kafka:kafka-server:$kafkaVersion"
    api "org.apache.kafka:kafka-server-common:$kafkaVersion"
    api "org.apache.kafka:kafka-storage:$kafkaVersion"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should likely check the pom file(s) and not the Gradle build files which mean many things. Also, if you see only "some" of them, it confirms that this is wrong and should be fixed.

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review/suggestion Jakub, I was able to remove extra exclusions since I saw kafka-server is the only exclusion which was required. Itt was the one bringing in all the other stuff. Thanks again

@im-konge
Copy link
Member

im-konge commented Apr 4, 2025

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ppatierno
Copy link
Member

So the issue we are facing in the commnunity needs more investigation but I think that applying "refresh" here should be enough. Anyway, I see the testKafkaCCAndRebalanceWithMultipleCOs test failing but for a different reasons AFAICS, which is not related to the PR itself. The test seems to "approve" by applying the annotation but, when using kubectl underneath, it's not overwriting the already existing annotation. Tbh I didn't expect this to fail because it works all times elsewhere. Maybe another reasons here @im-konge ?

@im-konge
Copy link
Member

im-konge commented Apr 7, 2025

As I'm discussing with Shubham. Should we annotate it with refresh annotation only if the KafkaRebalance is NotReady state? Or should we apply the approve annotation with force? That would mean changing the behavior a bit for other tests as well.
Or operator if it's in different state then NotReady and Ready will not refresh the rebalance?

@ppatierno
Copy link
Member

so as agreed offline, we can add the check of adding then "refresh" annotation only when the KafkaRebalance is in NotReady state (which is what we are facing here due to the old CC pod where the request goes). We'll try to address the main issue later on.

Signed-off-by: ShubhamRwt <[email protected]>
@ppatierno
Copy link
Member

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

<exclusion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-raft</artifactId>
</exclusion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the above two kafka-metadata and kafka-raft being brought as dependencies of the others? Just asking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its just kafka-server which was a new addition to the CC pom

@ppatierno ppatierno merged commit 6019073 into strimzi:main Apr 8, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cruise Control is currently not supported on Kafka 4.0
5 participants