-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Flush sinks during shutdown #11405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @jszwedko Any other alternative ways for us to handle this case in agents itself? Would be nice if vector can flush the logs upon termination also size based uploads like fluentbit/fluentd |
Hi @manivannan-g ! Vector can do size based uploads (see the One workaround is configuring a disk buffer on the sink so that the data is preserved across Vector restarts. |
Thanks! @jszwedko That would definitely helpful in case of pod (vector agent) restarts. May I know what happens during the node termination? does vector upload whatever in the buffer to S3 and exits? |
Ah, no, you would run into the same issue where Vector would fail to flush the data before shutting down. Also, in the event of a hard crash, the data would be lost. A setup I might suggest for you is:
Really, though, we should just resolve this issue which would reduce data loss for you except in the event of an abrupt node failure. |
I just wanted to ask as it may be somewhat related. We're thinking of using vector to process logs from AWS S3 using and SQS queue. The idea was to run vector in ECR Fargate and scale up based on CPU load and/or queue back pressure. The concern we have is during scale down if Vector handles termination gracefully? Our worry is that if we have a log file "in flight" and Vector gets the SIGTERM from ECR telling to terminate the container will it:
From reading this thread it seems like currently any job you have in-flight has 60 seconds to finish or it lost. The worry then is that half the data was sent on to the destination, the SQS event doesn't get deleted and thus is made visible again in the queue so it could be picked up by another node and processed leading to duplication. Anyone able to offer and insight if this would be the case? |
Hi @NeilJed ! Apologies for the delayed response. I think what you are describing would be the case, currently; that if Vector isn't able to finish processing after 60 seconds, it would terminate without deleting the SQS event. |
Hi! Are there updates on the prioritization of this feature? Handling the termination case is a critical requirement of our workflow. |
+1 for implementing this |
Does anyone know if there are any hacks we can take in the meantime to ensure chunks are flushed to S3? |
I set the batch timeout to lower than 60 seconds batch:
timeout_secs: 45 |
This has an edge case -- the source reads some events just before SIGTERM and make it into a batch that starts its timer 15 seconds after SIGTERM. Those logs won't get flushed since they would be flushed 30 seconds after the force shutdown occurs. Or, if for some reason it takes > 15 seconds to successfully send a batch to its destination. My idea here is questionable, but adds some more assurance that the batch timeout_secs will apply as desired. But this can't work with all Vector sources because it requires a way to "drain" them externally, and there's no drain source API. The gist is, if you can write a custom shutdown handler that captures the SIGTERM and cuts off the sources when the shutdown starts, then wait until they're "flushed" (either by monitoring In k8s you can do this with a preStop hook but again it's limited to sources that you can cut off outside of Vector. In my case I'm draining and shutting down the proxy that other systems send to Vector through, then waiting a bit before shutting down Vector itself. Possibly a wrapper bash script to start Vector and catch SIGTERM before it hits Vector could be used in non-k8s but I'm not too familiar with how that would work. This mostly is helpful for k8s with HPA scale-downs and when pods are moved off a node (such as for cluster upgrades or just resource management). With spot or preemptible instances, this would need to be combined with a small Which is all very hacky and only covers a few types of sources. Being able to increase the 60 seconds to some timeout value of our choosing would be a decent interim solution, until source+transform+sink draining and flushing is implemented. |
Some notes from a discussion: It's possible that the new-style sink architecture makes this easier. Assuming that's true we could just resolve this for the new-style sinks and continue efforts to port sinks. This issue is also only really important for sinks with long batch timeouts like the blob store sinks since the rest would flush within the grace period anyway. |
Any plans for this to be prioritized? |
<!-- **Your PR title must conform to the conventional commit spec!** <type>(<scope>)!: <description> * `type` = chore, enhancement, feat, fix, docs * `!` = OPTIONAL: signals a breaking change * `scope` = Optional when `type` is "chore" or "docs", available scopes https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20 * `description` = short description of the change Examples: * enhancement(file source): Add `sort` option to sort discovered files * feat(new source): Initial `statsd` source * fix(file source): Fix a bug discovering new files * chore(external docs): Clarify `batch_size` option --> Adding regression test for related issue: #11405
Closing this as this appears to work correctly on the latest version of Vector. #17667 adds a test for it. |
I'm still seeing a S3 sink time out on SIGTERM on Vector 0.32.1 – here is the schema of my sink configuration (w/ some fake values):
...and here are the logs that I am seeing –
All of my other sinks (an assortment of Loki / Prometheus Remote Write / etc.) flush fine. Is there some requirement that logs must stop being emitted from the source in order for the S3 sink to shut down properly? (I don't think this is the case for the other sinks?) |
Hi @shomilj ! One guess I have is that your disk buffer still has data in it. When shutting down, I believe the sinks will try to drain the buffers before stopping. Are you able to check if the disk buffer had data in it? You can look at this internal metric, https://vector.dev/docs/reference/configuration/sources/internal_metrics/#buffer_events, to check. If that is the case, I can open a separate issue since I think, for disk buffers, waiting is unnecessary given the data will be persisted. |
@jszwedko can you confirm that sinks try to drain disk buffers when SIGTERM is sent? Can you direct me to the code responsible for that? |
Correct, Vector will wait for buffers to drain. If that doesn't happen within 60 seconds (the default timeout) Vector will still shut down anyway. You can disable the shutdown timeout via |
A note for the community (please keep)
Community Note
Use Cases
Motivated by https://discord.com/channels/742820443487993987/746070591097798688/943246759294038017
The user in question has an
aws_s3
sink with a large batch timeout (as would be common):In this case, if a SIGTERM is sent to Vector, Vector won't try to flush the batch, but wait the 60 seconds and then just terminate.
Attempted Solutions
No response
Proposal
During shutdown, Vector should attempt to:
references
Version
No response
The text was updated successfully, but these errors were encountered: