Description
We have a large number of fluent-bit clients that deliver logging data to our fluentd backend, which processes and then forwards this data. We observed that they spawn a lot of connections, because by default the net.max_worker_connections is unlimited. For each chunk that is flushed to the output plugin, we see one tcp connection opened, thus resulting in a DoS on the backend. Documentation claims that I can set net.max_worker_connections to 1 to limit outgoing connection. Nevertheless, this results in a large number of retries, which also sometimes fail despite there is no data in flight.
I condensed it down to the following example with three chunks and net.max_worker_connections=1
:
- Input plugin collected three chunks.
- Input plugin will flush three chunks to output.
- Output will create new connection to backend, delivering one of the three chunks. The other two chunks will receive "no upstream connections available" and retry in a few seconds.
- Delivery of the first chunk finished -- connection is spare (net.tcp_keepalive is on), however other two chunks are ideling around and waiting for their retry.
- Retry of the two chunks triggers. Typically, at this point, one of them will successfully start to be transferred, the other one again be retried in a few seconds.
- The second chunk is now successfully delivered. Connection is again spare and we could immediately deliver chunk three - however we will wait for the retry
- retry of the last chunk is triggered.
Especially at point 5, if I have more chunks than three I retries will fail with the following message "[engine] chunk '20448-1747400805. 0.flb' cannot be retried: task_id=2, input=systemd.0 > output=forward.0". retries_failed and dropped_records will increase on metrics API.
I condensed it down to the following minimal example:
# fluent-bit.conf
[SERVICE]
Flush 15
Log_Level info
Parsers_File /etc/fluent-bit/parsers.conf
Hot_Reload On # only added only for debugging purposes
HTTP_Server On # only added only for debugging purposes
HTTP_Listen 0.0.0.0 # only added only for debugging purposes
HTTP_PORT 2020 # only added only for debugging purposes
[INPUT]
Name systemd
Tag systemd.*
Read_From_Tail On
Strip_Underscores On
DB /data/fluentbit/flb_cursor.db
Mem_Buf_Limit 2mb
Max_Entries 500
[OUTPUT]
#Name http
Name forward
Match *
Host 192.168.104.138
Port 24224
#Port 8888 # for http
net.keepalive on
net.keepalive_idle_timeout 30
net.max_worker_connections 1
workers 1
fluentd backend configuration:
<source>
@type http
port 8888
bind 0.0.0.0
</source>
<source>
@type forward
@log_level error
bind 0.0.0.0
port 24224
</source>
<match **>
@type stdout
</match>
to generate logs i use
for i in {1..100000}; do
logger "[$(date +%s)] Test log message $i"
sleep 0.1;
done
to investigate the TCP connections I use
sudo tcpdump -i any -n (port 8888 or port 24224) and ((tcp[tcpflags] & (tcp-syn) !=0) or (tcp[tcpflags] & (tcp-rst) !=0) or (tcp[tcpflags] & (tcp-fin) !=0))
My expectation would be that, when the input flushes e.g. three chunks to the output, that while the first chunk is transferred the other two are queued up until the first one finishes and then immediately continued without retries. How can I achieve this behavior?