[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

tangcent · 2025-01-03T08:10:58Z

Describe the bug

We’re experiencing periodic spikes in latency from the PubSubHealthIndicator in Spring Cloud GCP whenever there’s a large backlog in Google Cloud Pub/Sub. Although the backlog itself isn’t being pulled by the health check, the overall Pub/Sub system (or network) slows down enough that the “quick pull” call hangs or times out. This marks our service as DOWN in /actuator/health, which can trigger restarts in Kubernetes, creating a negative feedback loop.

Logs:

Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 10936ms to respond
Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 89844ms to respond

Im not sure if updating the health-check settings (e.g., timeouts) would resolve this issue, or if we should exclude the PubSubHealthIndicator from the group of core /actuator/health.
Since Pub/Sub is designed to handle backlogs to protect our service from being overwhelmed during high traffic periods, relying on a “quick pull” to measure health may not be best practice in production.
btw, I’m not entirely sure why the “quick pull” call either hangs or times out when other topics start having backlogs.

Any guidance or recommended patterns on handling these scenarios while still monitoring Pub/Sub health would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

diegomarquezp · 2025-01-15T16:26:31Z

Hi @tangcent, thanks for reporting this issue.
Could you give us more details regarding the nature of the backlog and your project setup? Maybe a minimal reproducer may help us look into whether this is a configuration issue or anything else.

tangcent · 2025-01-16T06:52:57Z

hi @diegomarquezp, thank you for looking into this!
I've created a MVP to help reproduce the issue with the PubSubHealthIndicator under large GCP Pub/Sub backlogs. Below is the code for a PubsubTestService class that simulates publishing and subscribing to pubsub events using Spring Cloud GCP's PubSubTemplate.

@Service
class PubsubTestService(
    private val pubSubTemplate: PubSubTemplate
) {
    companion object : KLogging() {
        const val TOPIC_NAME = "xxx-test"
        const val SUBSCRIPTION_NAME = "xxx-test-sub"

        //const val HEALTH_CHECK_EVENT_COUNT = 0
        const val HEALTH_CHECK_EVENT_COUNT = 100

        //5s
        //const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 5000L

        //10s
        //const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 10000L

        //30s
        const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 30000L
    }

    @PostConstruct
    fun init() {
        subscribeHealthCheckEvent()
        publishHealthCheckEvent(HEALTH_CHECK_EVENT_COUNT)
    }

    fun subscribeHealthCheckEvent() {
        pubSubTemplate.subscribe(SUBSCRIPTION_NAME) { message ->
            val data = message.getPubsubMessage().getData().toStringUtf8()
            val event = Gson().fromJson(data, HealthCheckEvent::class.java)
            println("Received health check event: $event")
            Thread.sleep(HEALTH_CHECK_EVENT_PROCESSING_TIME_MS)
            println("Health check event processed")
        }
    }

    fun publishHealthCheckEvent(eventCount: Int) {
        repeat(eventCount) {
            val event = HealthCheckEvent(System.currentTimeMillis())
            println("Publishing health check event: $event")
            val data = ByteString.copyFromUtf8(Gson().toJson(event))
            val pubsubMessage = PubsubMessage.newBuilder().setData(data).build()
            pubSubTemplate.publish(TOPIC_NAME, pubsubMessage).addCallback(PubsubListenableFutureCallback)
        }
    }
}

data class HealthCheckEvent(
    val time: Long
)

object PubsubListenableFutureCallback : ListenableFutureCallback<String>, KLogging() {

    override fun onSuccess(result: String?) {
        logger.debug { "PubsubListenableFutureCallback success: $result" }
    }

    override fun onFailure(ex: Throwable) {
        logger.error(ex) { "PubsubListenableFutureCallback error: ${ex.message}" }
    }
}

I've noticed that the PubSubHealthIndicator's response time is directly affected by the subscribe method. It seems that the PubSubTemplate#pull method used within PubSubHealthIndicator shares the same thread pool with PubSubTemplate#subscribe, which causes delays when processing large backlogs.

Let me know if you need further clarifications or modifications.

diegomarquezp added type: question Further information is requested priority: p2 labels Jan 15, 2025

mpeddada1 added type: bug Something isn't working and removed type: question Further information is requested labels Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

tangcent commented Jan 3, 2025

diegomarquezp commented Jan 15, 2025

tangcent commented Jan 16, 2025

[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

Comments

tangcent commented Jan 3, 2025

diegomarquezp commented Jan 15, 2025

tangcent commented Jan 16, 2025