bug: Vector stops sending logs after StatefulSet restart due to headless service #1938

StianOvrevage · 2025-01-20T23:59:54Z

Chart name and version
chart: victoria-logs-single
version: v0.8.13

Describe the bug
TL:DR; Vector gets stuck unable to send logs to VL after restart of VL StatefulSet due to Vector using direct IP for the VL Pod, which changes after restart, without Vector being able to reconnect.

We have just deployed VictoriaLogs and Vector using the helm chart and as many default values as possible.

In the cluster (GKE) we also have our own deployment of istio (v1.24).

As defined in the helm chart, we get a StatefulSet and a headless Service (request: add -headless suffix to the Service since this stumped us for a few minutes). It does not appear to be possible to make the Service not headless (i.e. to have it obtain a ClusterIP and thus let K8s manage routing).

By default the helm chart produces config for Vector with endpoint statefulset-pod-name-0.victorialogs-namespace.svc.cluster.local. This makes Vector get the IP of statefulset-pod-name-0 directly. However if statefulset-pod-name-0 is ever restarted or rescheduled for any reason, it's IP will change. This change is not picked up by Vector (or possibly istio-proxy sidecar), causing it to be stuck unable to send logs until all Vector pods are restarted.

The logs emitted by vector look like this

2025-01-20T11:34:49.192561Z ERROR sink{component_kind="sink" component_id=vlogs component_type=elasticsearch}:request{request_id=272}: vector::sinks::elasticsearch::service: Response contained errors. error_code="http_response_503" response=Response { status: 503, version: HTTP/1.1, headers: {"content-length": "114", "content-type": "text/plain", "date": "Mon, 20 Jan 2025 11:34:48 GMT", "server": "envoy"}, body: b"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout" }
2025-01-20T11:34:49.192619Z  WARN sink{component_kind="sink" component_id=vlogs component_type=elasticsearch}:request{request_id=272}: vector::sinks::util::retries: Retrying after response. reason=503 Service Unavailable: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout internal_log_rate_limit=true

I'm not sure if this is 100% Vectors fault (it's docs say it does complete reconnects every time), istio's fault (this may indicate it's not entirely unrelated: istio/istio#54539 ), or a combination.

Proposed fix
If we can have an option in the helm chart to generate a Service that is not headless. Or choose to create an additional non-headless service, this would "fix" the problem.

I'm sure you have your reasons for using a headless Service with regards to HA and clustering etc. But for the time being and VictoriaLogs being a "Single instance" service I don't see any significant drawbacks of using a regular non-headless Service.

Custom values
Relevant excerpts of our values.yaml. This contains workarounds to the problems above by overriding the default Vector endpoint to a custom VictoriaLogs Service we've deployed.

global:
  cluster:
    # Override since trailing dot is not understood by istio
    dnsDomain: cluster.local

server:
  retentionPeriod: 14d
  # Disk space usage in gigabytes
  retentionDiskSpaceUsage: "25"
  persistentVolume:
    enabled: true
    storageClassName: "standard"
    size: 30Gi

vector:
  enabled: true
  podPriorityClassName: "high-priority-preempt"

  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 100%

    sinks:
      vlogs:
        # Our own custom non-headless Service
        endpoints: ["http://vls-victoria-logs-single-server-clusterip.logging-v2.svc.cluster.local:9428/insert/elasticsearch"]

The text was updated successfully, but these errors were encountered:

AndrewChubatiuk · 2025-02-05T20:03:15Z

hey @StianOvrevage
changing service type to other than ClusterIP (using .Values.server.service.type) will create non-headless load balancer, but also created a PR, which allows to override default .Values.server.service.clusterIP: None to other value

AndrewChubatiuk · 2025-02-06T05:51:30Z

just issued release 0.8.15, you can override .Values.server.service.clusterIP with an empty value

github-actions · 2025-03-09T01:51:15Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2025-03-14T02:02:39Z

This issue was closed because it has been stalled for 5 days with no activity.

AndrewChubatiuk mentioned this issue Feb 5, 2025

added ability to override default headless service #1975

Merged

github-actions bot added the Stale label Mar 9, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 14, 2025

AndrewChubatiuk reopened this Mar 14, 2025

github-actions bot removed the Stale label Mar 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Vector stops sending logs after StatefulSet restart due to headless service #1938

bug: Vector stops sending logs after StatefulSet restart due to headless service #1938

StianOvrevage commented Jan 20, 2025

AndrewChubatiuk commented Feb 5, 2025

AndrewChubatiuk commented Feb 6, 2025

github-actions bot commented Mar 9, 2025

github-actions bot commented Mar 14, 2025

bug: Vector stops sending logs after StatefulSet restart due to headless service #1938

bug: Vector stops sending logs after StatefulSet restart due to headless service #1938

Comments

StianOvrevage commented Jan 20, 2025

AndrewChubatiuk commented Feb 5, 2025

AndrewChubatiuk commented Feb 6, 2025

github-actions bot commented Mar 9, 2025

github-actions bot commented Mar 14, 2025