Description
Steps to reproduce:
List the minimal actions needed to reproduce the behaviour.
- Workload with no deviation from long-term trends
- No outside action (increased writes to buckets, read query with many datapoints in result executed etc.)
Expected behaviour:
No outage of Influx service seemingly without outside action
Actual behaviour:
Once every one or two weeks around 00:00 UTC most of write (Telegraf) and read (Grafana) requests ends with timeout error on the clients and memory consumption increases until the influxd process is reaped by oom killer. I'm suspecting some task around compaction, shard rotation or similar background task on the Telegraf bucket as the outages seems to have fixed period of appearance, traffic volume is not outside of long term trends that could cause increased usage of system resources and write / read requests to other buckets have normal (2xx) http status.
The Telegraf bucket i'm suspecting could cause the problem is reporting ~350G storage size (storage_shard_disk_size) and have "forever" retention policy, which means shard duration 7 days as i haven't configured custom shard duration.
Similar issue has been discussed in this thread #24406 I couldn't reopen the issue, but the problem seems very similar.
Environment info:
- Influx running in podman from image docker.io/library/influxdb:2.7 (amd64 2.7.10 version), no other container on the host is running
- Podman host is virtual machine with 16vcpu, 32G ram
- Custom environment variables:
INFLUXD_LOG_LEVEL=debug
INFLUXD_STORAGE_MAX_CONCURRENT_COMPACTIONS=2
INFLUXD_STORAGE_COMPACT_THROUGHPUT_BURST=10331648
Logs:
Heap images before the problem occurs, 10 minutes after and last minute before the oom killer reaped the process
Logs from Telegraf agent (write)
2025-06-16T00:00:18Z E! [outputs.influxdb_v2] When writing to [http://influxhost:8086/api/v2/write]: Post "http://influxhost:8086/api/v2/write?bucket=telegraf&org=custom_org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2025-06-16T00:00:18Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
...
2025-06-16T00:28:03Z E! [outputs.influxdb_v2] When writing to [http://influxhost:8086/api/v2/write]: Post "http://influxhost:8086/api/v2/write?bucket=telegraf&org=custom_org": dial tcp influxhost:8086: connect: connection refused
2025-06-16T00:28:03Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
Logs from Grafana (read)
2025-06-16T00:00:40.726475259Z logger=tsdb.influx_flux endpoint=queryData pluginId=influxdb dsName=influxdb-v2 dsUID=xxx uname=grafana_scheduler rule_uid=xxx org_id=1 t=2025-06-16T00:00:40.726475259Z level=warn msg="Flux query failed" err="Post \"http://influxhost:8086/api/v2/query?org=custom_org\": context deadline exceeded" query="from(bucket: \"telegraf\")..."
...
2025-06-16T00:28:06.297792863Z logger=tsdb.influx_flux endpoint=queryData pluginId=influxdb dsName=influxdb-v2 dsUID=xxx uname=grafana_scheduler rule_uid=xxx org_id=1 t=2025-06-16T00:28:06.297792863Z level=warn msg="Flux query failed" err="Post \"http://influxhost:8086/api/v2/query?org=custom_org\": dial tcp influxhost:8086: connect: connection refused" query="from(bucket: \"telegraf\")..."
Logs from Influx
ts=2025-06-16T00:00:13.147125Z lvl=info msg="Write failed" log_id=0wwmuKel000 service=storage-engine service=write shard=6443 error="engine: context canceled"
ts=2025-06-16T00:00:13.147342Z lvl=debug msg=Request log_id=0wwmuKel000 service=http method=POST host=influxhost:8086 path=/api/v2/write query="bucket=telegraf&org=custom_org" proto=HTTP/1.1 status_code=499 response_size=107 content_length=-1 referrer= remote=linuxserver:44892 user_agent=Telegraf authenticated_id=0b26d84447adf000 user_id=0ac7beb338afa000 took=10117.641ms error="internal error" error_code="internal error"
...
ts=2025-06-16T00:26:36.086156Z lvl=debug msg=Request log_id=0wwmuKel000 service=http method=POST host=influxhost:8086 path=/api/v2/write query="bucket=telegraf&org=custom_org" proto=HTTP/1.1 status_code=499 response_size=90 content_length=-1 referrer= remote=linuxhost:44006 user_agent=Telegraf authenticated_id=0b26d84447adf000 user_id=0ac7beb338afa000 took=10024.616ms error="internal error" error_code="internal error"
...
2025-06-16T00:26:36 systemd[1]: libpod-9d2217b610be3ef850f79a65a2d743487c9cf71d806c53c39320039fe05fd325.scope: A process of this unit has been killed by the OOM killer.