Skip to content

Many retries when net.max_worker_connections 1 that sometimes fail #10342

Closed
@johannes-vg

Description

@johannes-vg

We have a large number of fluent-bit clients that deliver logging data to our fluentd backend, which processes and then forwards this data. We observed that they spawn a lot of connections, because by default the net.max_worker_connections is unlimited. For each chunk that is flushed to the output plugin, we see one tcp connection opened, thus resulting in a DoS on the backend. Documentation claims that I can set net.max_worker_connections to 1 to limit outgoing connection. Nevertheless, this results in a large number of retries, which also sometimes fail despite there is no data in flight.

I condensed it down to the following example with three chunks and net.max_worker_connections=1:

  1. Input plugin collected three chunks.
  2. Input plugin will flush three chunks to output.
  3. Output will create new connection to backend, delivering one of the three chunks. The other two chunks will receive "no upstream connections available" and retry in a few seconds.
  4. Delivery of the first chunk finished -- connection is spare (net.tcp_keepalive is on), however other two chunks are ideling around and waiting for their retry.
  5. Retry of the two chunks triggers. Typically, at this point, one of them will successfully start to be transferred, the other one again be retried in a few seconds.
  6. The second chunk is now successfully delivered. Connection is again spare and we could immediately deliver chunk three - however we will wait for the retry
  7. retry of the last chunk is triggered.

Especially at point 5, if I have more chunks than three I retries will fail with the following message "[engine] chunk '20448-1747400805. 0.flb' cannot be retried: task_id=2, input=systemd.0 > output=forward.0". retries_failed and dropped_records will increase on metrics API.

I condensed it down to the following minimal example:

# fluent-bit.conf
[SERVICE]
    Flush             15
    Log_Level         info
    Parsers_File      /etc/fluent-bit/parsers.conf
    Hot_Reload        On # only added only for debugging purposes
    HTTP_Server  On      # only added only for debugging purposes
    HTTP_Listen  0.0.0.0 # only added only for debugging purposes
    HTTP_PORT    2020    # only added only for debugging purposes

[INPUT]
    Name              systemd
    Tag               systemd.*
    Read_From_Tail    On
    Strip_Underscores On
    DB                /data/fluentbit/flb_cursor.db
    Mem_Buf_Limit     2mb
    Max_Entries       500

[OUTPUT]
    #Name              http
    Name              forward
    Match             *
    Host 192.168.104.138
    Port 24224
    #Port 8888  # for http

    net.keepalive on
    net.keepalive_idle_timeout 30
    net.max_worker_connections 1
    workers 1

fluentd backend configuration:

<source>
  @type http
  port 8888
  bind 0.0.0.0
</source>

<source>
  @type forward
  @log_level error
  bind 0.0.0.0
  port 24224
</source>

<match **>
  @type stdout
</match>

to generate logs i use

for i in {1..100000}; do  
  logger "[$(date +%s)] Test log message $i"
  sleep 0.1;
done

to investigate the TCP connections I use

sudo tcpdump -i any -n (port 8888 or port 24224) and ((tcp[tcpflags] & (tcp-syn) !=0) or (tcp[tcpflags] & (tcp-rst) !=0) or (tcp[tcpflags] & (tcp-fin) !=0))

My expectation would be that, when the input flushes e.g. three chunks to the output, that while the first chunk is transferred the other two are queued up until the first one finishes and then immediately continued without retries. How can I achieve this behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions