Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spring-cloud-gcp-autoconfigure] PubSubHealthIndicator Fails Under Large GCP Pub/Sub Backlogs, Triggering Negative Feedback Loop #3438

Open
tangcent opened this issue Jan 3, 2025 · 2 comments
Labels
priority: p2 type: bug Something isn't working

Comments

@tangcent
Copy link

tangcent commented Jan 3, 2025

Describe the bug

We’re experiencing periodic spikes in latency from the PubSubHealthIndicator in Spring Cloud GCP whenever there’s a large backlog in Google Cloud Pub/Sub. Although the backlog itself isn’t being pulled by the health check, the overall Pub/Sub system (or network) slows down enough that the “quick pull” call hangs or times out. This marks our service as DOWN in /actuator/health, which can trigger restarts in Kubernetes, creating a negative feedback loop.

Logs:

Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 10936ms to respond
Health contributor org.springframework.cloud.gcp.autoconfigure.pubsub.health.PubSubHealthIndicator (pubSub) took 89844ms to respond

Im not sure if updating the health-check settings (e.g., timeouts) would resolve this issue, or if we should exclude the PubSubHealthIndicator from the group of core /actuator/health.
Since Pub/Sub is designed to handle backlogs to protect our service from being overwhelmed during high traffic periods, relying on a “quick pull” to measure health may not be best practice in production.
btw, I’m not entirely sure why the “quick pull” call either hangs or times out when other topics start having backlogs.

Any guidance or recommended patterns on handling these scenarios while still monitoring Pub/Sub health would be greatly appreciated.

@diegomarquezp diegomarquezp added type: question Further information is requested priority: p2 labels Jan 15, 2025
@diegomarquezp
Copy link
Contributor

Hi @tangcent, thanks for reporting this issue.
Could you give us more details regarding the nature of the backlog and your project setup? Maybe a minimal reproducer may help us look into whether this is a configuration issue or anything else.

@tangcent
Copy link
Author

hi @diegomarquezp, thank you for looking into this!
I've created a MVP to help reproduce the issue with the PubSubHealthIndicator under large GCP Pub/Sub backlogs. Below is the code for a PubsubTestService class that simulates publishing and subscribing to pubsub events using Spring Cloud GCP's PubSubTemplate.

@Service
class PubsubTestService(
    private val pubSubTemplate: PubSubTemplate
) {
    companion object : KLogging() {
        const val TOPIC_NAME = "xxx-test"
        const val SUBSCRIPTION_NAME = "xxx-test-sub"

        //const val HEALTH_CHECK_EVENT_COUNT = 0
        const val HEALTH_CHECK_EVENT_COUNT = 100

        //5s
        //const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 5000L

        //10s
        //const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 10000L

        //30s
        const val HEALTH_CHECK_EVENT_PROCESSING_TIME_MS = 30000L
    }

    @PostConstruct
    fun init() {
        subscribeHealthCheckEvent()
        publishHealthCheckEvent(HEALTH_CHECK_EVENT_COUNT)
    }

    fun subscribeHealthCheckEvent() {
        pubSubTemplate.subscribe(SUBSCRIPTION_NAME) { message ->
            val data = message.getPubsubMessage().getData().toStringUtf8()
            val event = Gson().fromJson(data, HealthCheckEvent::class.java)
            println("Received health check event: $event")
            Thread.sleep(HEALTH_CHECK_EVENT_PROCESSING_TIME_MS)
            println("Health check event processed")
        }
    }

    fun publishHealthCheckEvent(eventCount: Int) {
        repeat(eventCount) {
            val event = HealthCheckEvent(System.currentTimeMillis())
            println("Publishing health check event: $event")
            val data = ByteString.copyFromUtf8(Gson().toJson(event))
            val pubsubMessage = PubsubMessage.newBuilder().setData(data).build()
            pubSubTemplate.publish(TOPIC_NAME, pubsubMessage).addCallback(PubsubListenableFutureCallback)
        }
    }
}

data class HealthCheckEvent(
    val time: Long
)

object PubsubListenableFutureCallback : ListenableFutureCallback<String>, KLogging() {

    override fun onSuccess(result: String?) {
        logger.debug { "PubsubListenableFutureCallback success: $result" }
    }

    override fun onFailure(ex: Throwable) {
        logger.error(ex) { "PubsubListenableFutureCallback error: ${ex.message}" }
    }
}

I've noticed that the PubSubHealthIndicator's response time is directly affected by the subscribe method. It seems that the PubSubTemplate#pull method used within PubSubHealthIndicator shares the same thread pool with PubSubTemplate#subscribe, which causes delays when processing large backlogs.

Let me know if you need further clarifications or modifications.

@mpeddada1 mpeddada1 added type: bug Something isn't working and removed type: question Further information is requested labels Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants