You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We’re experiencing periodic spikes in latency from the PubSubHealthIndicator in Spring Cloud GCP whenever there’s a large backlog in Google Cloud Pub/Sub. Although the backlog itself isn’t being pulled by the health check, the overall Pub/Sub system (or network) slows down enough that the “quick pull” call hangs or times out. This marks our service as DOWN in /actuator/health, which can trigger restarts in Kubernetes, creating a negative feedback loop.
Logs:
Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 10936ms to respond
Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 89844ms to respond
Im not sure if updating the health-check settings (e.g., timeouts) would resolve this issue, or if we should exclude the PubSubHealthIndicator from the group of core /actuator/health.
Since Pub/Sub is designed to handle backlogs to protect our service from being overwhelmed during high traffic periods, relying on a “quick pull” to measure health may not be best practice in production.
btw, I’m not entirely sure why the “quick pull” call either hangs or times out when other topics start having backlogs.
Any guidance or recommended patterns on handling these scenarios while still monitoring Pub/Sub health would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
Hi @tangcent, thanks for reporting this issue.
Could you give us more details regarding the nature of the backlog and your project setup? Maybe a minimal reproducer may help us look into whether this is a configuration issue or anything else.
hi @diegomarquezp, thank you for looking into this!
I've created a MVP to help reproduce the issue with the PubSubHealthIndicator under large GCP Pub/Sub backlogs. Below is the code for a PubsubTestService class that simulates publishing and subscribing to pubsub events using Spring Cloud GCP's PubSubTemplate.
@Service
classPubsubTestService(
privatevalpubSubTemplate:PubSubTemplate
) {
companionobject:KLogging() {
constvalTOPIC_NAME="xxx-test"constvalSUBSCRIPTION_NAME="xxx-test-sub"//const val HEALTH_CHECK_EVENT_COUNT = 0constvalHEALTH_CHECK_EVENT_COUNT=100//5s//const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 5000L//10s//const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 10000L//30sconstvalHEALTH_CHECK_EVENT_PROCESSING_TIME_MS=30000L
}
@PostConstruct
funinit() {
subscribeHealthCheckEvent()
publishHealthCheckEvent(HEALTH_CHECK_EVENT_COUNT)
}
funsubscribeHealthCheckEvent() {
pubSubTemplate.subscribe(SUBSCRIPTION_NAME) { message ->val data = message.getPubsubMessage().getData().toStringUtf8()
val event =Gson().fromJson(data, HealthCheckEvent::class.java)
println("Received health check event: $event")
Thread.sleep(HEALTH_CHECK_EVENT_PROCESSING_TIME_MS)
println("Health check event processed")
}
}
funpublishHealthCheckEvent(eventCount:Int) {
repeat(eventCount) {
val event =HealthCheckEvent(System.currentTimeMillis())
println("Publishing health check event: $event")
val data =ByteString.copyFromUtf8(Gson().toJson(event))
val pubsubMessage =PubsubMessage.newBuilder().setData(data).build()
pubSubTemplate.publish(TOPIC_NAME, pubsubMessage).addCallback(PubsubListenableFutureCallback)
}
}
}
data classHealthCheckEvent(
valtime:Long
)
object PubsubListenableFutureCallback : ListenableFutureCallback<String>, KLogging() {
overridefunonSuccess(result:String?) {
logger.debug { "PubsubListenableFutureCallback success: $result" }
}
overridefunonFailure(ex:Throwable) {
logger.error(ex) { "PubsubListenableFutureCallback error: ${ex.message}" }
}
}
I've noticed that the PubSubHealthIndicator's response time is directly affected by the subscribe method. It seems that the PubSubTemplate#pull method used within PubSubHealthIndicator shares the same thread pool with PubSubTemplate#subscribe, which causes delays when processing large backlogs.
Let me know if you need further clarifications or modifications.
Describe the bug
We’re experiencing periodic spikes in latency from the
PubSubHealthIndicator
in Spring Cloud GCP whenever there’s a large backlog in Google Cloud Pub/Sub. Although the backlog itself isn’t being pulled by the health check, the overall Pub/Sub system (or network) slows down enough that the “quick pull” call hangs or times out. This marks our service as DOWN in/actuator/health
, which can trigger restarts in Kubernetes, creating a negative feedback loop.Logs:
Im not sure if updating the health-check settings (e.g., timeouts) would resolve this issue, or if we should exclude the
PubSubHealthIndicator
from the group of core/actuator/health
.Since Pub/Sub is designed to handle backlogs to protect our service from being overwhelmed during high traffic periods, relying on a “quick pull” to measure health may not be best practice in production.
btw, I’m not entirely sure why the “quick pull” call either hangs or times out when other topics start having backlogs.
Any guidance or recommended patterns on handling these scenarios while still monitoring Pub/Sub health would be greatly appreciated.
The text was updated successfully, but these errors were encountered: