Skip to content

vmauth crashes - panic: runtime error: invalid memory address or nil pointer dereference #8051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bweglarz opened this issue Jan 16, 2025 · 14 comments
Assignees
Labels
bug Something isn't working vmauth

Comments

@bweglarz
Copy link

bweglarz commented Jan 16, 2025

Describe the bug

We upgraded vmauth from 1.97.1 to 1.108.1 and when we increased the load on the cluster to 1M datapoints per second and 100+ queries per second we started seeing errors in the Request error rate chart in Victoria Metrics dashboard - see attached screenshot.

After some investigation it turned out that vmauth was crashing with the following stack trace:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x2c pc=0x6b42e4]
goroutine 25756 [running]:
net/http.(*body).Read(0xc0002769d0?, {0xc000130000?, 0x0?, 0x1?})
	net/http/transfer.go:831 +0x44
main.(*readTrackingBody).Read(0xc0002b1630, {0xc000130000, 0x46c405?, 0x2000})
	[github.com/VictoriaMetrics/VictoriaMetrics/app/vmauth/main.go:622](http://github.com/VictoriaMetrics/VictoriaMetrics/app/vmauth/main.go:622) +0x122
io.discard.ReadFrom({}, {0x7f2d05c49d18, 0xc0002b1630})
	io/io.go:666 +0x6d
io.copyBuffer({0x97a080, 0xc210e0}, {0x7f2d05c49d18, 0xc0002b1630}, {0xc000184000, 0x8000, 0x8000})
	io/io.go:415 +0x151
io.CopyBuffer({0x97a080?, 0xc210e0?}, {0x7f2d05c49d18?, 0xc0002b1630?}, {0xc000184000?, 0xc40?, 0xc0000b6e08?})
	io/io.go:402 +0x36
net/http.(*transferWriter).doBodyCopy(0xc000166320, {0x97a080, 0xc210e0}, {0x7f2d05c49d18, 0xc0002b1630})
	net/http/transfer.go:416 +0xe5
net/http.(*transferWriter).writeBody(0xc000166320, {0x9796a0, 0xc0002441c0})
	net/http/transfer.go:376 +0x3a5
net/http.(*Request).write(0xc0001fe3c0, {0x9796a0, 0xc0002441c0}, 0x0, 0x0, 0x0)
	net/http/request.go:771 +0xaed
net/http.(*persistConn).writeLoop(0xc00026e000)
	net/http/transport.go:2522 +0x174
created by net/http.(*Transport).dialConn in goroutine 25752
	net/http/transport.go:1875 +0x15a5

We reverted the version to 1.97.1 and the error disappeared.
Later on we started upgrades to 1.101.0 and 1.102.1.
It looks like the bug was introduced in 1.102.1 or between 1.101.0 and 1.102.1.

The trace from 1.102.1 looks as follows:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2c pc=0x65cd44]

goroutine 2031993 [running]:
net/http.(*body).Read(0xc0001fca38?, {0xc000364000?, 0x47db1d?, 0xc000258f60?})
	net/http/transfer.go:827 +0x44
main.(*readTrackingBody).Read(0xc000185b30, {0xc000364000, 0x63bfab?, 0x2000})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmauth/main.go:599 +0x122
io.discard.ReadFrom({}, {0x7f378c9d54e8, 0xc000185b30})
	io/io.go:666 +0x6d
io.copyBuffer({0x8d6040, 0xb6cec0}, {0x7f378c9d54e8, 0xc000185b30}, {0xc00029e000, 0x8000, 0x8000})
	io/io.go:415 +0x151
io.CopyBuffer({0x8d6040?, 0xb6cec0?}, {0x7f378c9d54e8?, 0xc000185b30?}, {0xc00029e000?, 0x18?, 0xc000100008?})
	io/io.go:402 +0x36
net/http.(*transferWriter).doBodyCopy(0xc000170fa0, {0x8d6040, 0xb6cec0}, {0x7f378c9d54e8, 0xc000185b30})
	net/http/transfer.go:416 +0xe8
net/http.(*transferWriter).writeBody(0xc000170fa0, {0x8d5900, 0xc0001b1880})
	net/http/transfer.go:376 +0x3ac
net/http.(*Request).write(0xc000298d80, {0x8d5900, 0xc0001b1880}, 0x0, 0x0, 0x0)
	net/http/request.go:755 +0xb0d
net/http.(*persistConn).writeLoop(0xc000298ea0)
	net/http/transport.go:2461 +0x18e
created by net/http.(*Transport).dialConn in goroutine 2031989
	net/http/transport.go:1800 +0x1585

To Reproduce

Deploy vmauth v1.102.1 which is in front of vminsert and vmselect and run some decent load against the cluster - in our case it was around 1M datapoints per second and 100+ queries per second (vmauth was co-located with vminsert and vmselect as a sidecar and only the one with vminsert crashed) - we used https://github.com/VictoriaMetrics/prometheus-benchmark to generate the load.

NAME                        READY   STATUS    RESTARTS        AGE
vminsert-5c58898557-467fj   4/4     Running   0               12h
vminsert-5c58898557-5t2sv   4/4     Running   1 (10h ago)     12h
vminsert-5c58898557-6cw2d   4/4     Running   1 (11h ago)     12h
vminsert-5c58898557-6znr6   4/4     Running   1 (11h ago)     12h
vminsert-5c58898557-72kx8   4/4     Running   0               12h
vminsert-5c58898557-7p9s7   4/4     Running   1 (3h49m ago)   12h
vminsert-5c58898557-d6wwc   4/4     Running   1 (133m ago)    12h
vminsert-5c58898557-g6prb   4/4     Running   1 (140m ago)    12h
vminsert-5c58898557-j8qcw   4/4     Running   0               12h
vminsert-5c58898557-jxdmz   4/4     Running   3 (3h30m ago)   12h
vminsert-5c58898557-ks7gb   4/4     Running   0               12h
vminsert-5c58898557-nz8tq   4/4     Running   3 (5h35m ago)   12h
vminsert-5c58898557-q768d   4/4     Running   1 (3h35m ago)   12h
vminsert-5c58898557-vgpt5   4/4     Running   1 (3h49m ago)   12h
vminsert-5c58898557-xhg78   4/4     Running   1 (12h ago)     12h

Version

~ $ /vmauth-prod --version
vmauth-20240801-125905-tags-v1.102.1-0-g996b623585

Logs

No response

Screenshots

Image
Image

Used command-line flags

No response

Additional information

No response

@bweglarz bweglarz added the bug Something isn't working label Jan 16, 2025
@bweglarz
Copy link
Author

bweglarz commented Jan 16, 2025

When vminsert was reporting the errors we could see the following logs:

{"ts":"2025-01-14T11:05:59.898Z","level":"warn","caller":"VictoriaMetrics/app/vminsert/main.go:283","msg":"remoteAddr: "[127.0.0.1:44494](https://127.0.0.1:44494/), X-Forwarded-For: [139.144.210.24](https://139.144.210.24/), [10.2.5.3](https://10.2.5.3/), [127.0.0.1](https://127.0.0.1/)"; requestURI: /insert/1/prometheus/api/v1/write?extra_label=url_replica%3D0; cannot read compressed request in 0 seconds: unexpected EOF"}

@jiekun jiekun added the vmauth label Jan 16, 2025
@jiekun
Copy link
Contributor

jiekun commented Jan 16, 2025

some context based discussed in slack:

@bweglarz
Copy link
Author

bweglarz commented Jan 16, 2025

Latest update - deployed v.1.102.0 and we can see the same issues.
~ $ /vmauth-prod --version vmauth-20240717-185516-tags-v1.102.0-0-g65bb429b81

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2c pc=0x65cd44]
goroutine 30098 [running]:
net/http.(*body).Read(0xc000291a38?, {0xc00013c000?, 0x47db1d?, 0xc0001f4660?})
	net/http/transfer.go:827 +0x44
main.(*readTrackingBody).Read(0xc000184d70, {0xc00013c000, 0x63bfab?, 0x2000})
	github.com/VictoriaMetrics/VictoriaMetrics/app/vmauth/main.go:566 +0x122
io.discard.ReadFrom({}, {0x7fa55266db78, 0xc000184d70})
	io/io.go:666 +0x6d
io.copyBuffer({0x8d5e20, 0xb6bec0}, {0x7fa55266db78, 0xc000184d70}, {0xc0003d0000, 0x8000, 0x8000})
	io/io.go:415 +0x151
io.CopyBuffer({0x8d5e20?, 0xb6bec0?}, {0x7fa55266db78?, 0xc000184d70?}, {0xc0003d0000?, 0x18?, 0xc0001c4808?})
	io/io.go:402 +0x36
net/http.(*transferWriter).doBodyCopy(0xc0001c21e0, {0x8d5e20, 0xb6bec0}, {0x7fa55266db78, 0xc000184d70})
	net/http/transfer.go:416 +0xe8
net/http.(*transferWriter).writeBody(0xc0001c21e0, {0x8d56e0, 0xc0001ac400})
	net/http/transfer.go:376 +0x3ac
net/http.(*Request).write(0xc000211440, {0x8d56e0, 0xc0001ac400}, 0x0, 0x0, 0x0)
	net/http/request.go:755 +0xb0d
net/http.(*persistConn).writeLoop(0xc000211560)
	net/http/transport.go:2461 +0x18e
created by net/http.(*Transport).dialConn in goroutine 30094
	net/http/transport.go:1800 +0x1585

@f41gh7
Copy link
Contributor

f41gh7 commented Jan 16, 2025

Most probably, issue is related to this commit 7ee5797

Introduced at 1.97.7 release.

It adds readTrackingBody objects pool and it it looks like, net.http package holds request body longer than main function.

@f41gh7
Copy link
Contributor

f41gh7 commented Jan 17, 2025

Hello, could you please provide the following information:

  • how is vmauth collocated and configured as a sidecar. Is it unauthorized config with a list of other vminsert servers or it's used for authentication only?
  • any default flags were changed?
  • which arch is host system (amd64/arm/?)
  • Is it official docker image or manually built?

This error should never happen to the vmauth, unless there is unknown bug at net/http golang std library. We have really big setups with vmauth and never faced this kind of error.

An other thing, that could be related to this issue is collocation with vminsert.

@bweglarz
Copy link
Author

bweglarz commented Jan 17, 2025

Hi @f41gh7,
Here are the answers to your questions:

  • setup/collocation - mtls to nginx sidecar from k8s ingress (CN extracted by k8s ingress and attached as a header)-> vmauth (localhost over http) -> vminsert (localhost over http)
    Sample vmauth config
  users:
  - bearer_token: CN1
    url_map:
      - src_paths:
          - /api/v1/write
        url_prefix: 'http://localhost:8480/insert/0/prometheus'
      - src_paths:
          - /opentelemetry/v1/metrics
        url_prefix: 'http://localhost:8480/insert/0'

  • flags: /vmauth-prod -auth.config=/config/auth.yml -envflag.enable=true -envflag.prefix=VM_ -loggerFormat=json -httpListenAddr=127.0.0.1:8427 -configCheckInterval=10s -maxConcurrentPerUserRequests=350 -maxConcurrentReque
    sts=1750
  • host system: Linux version 6.1.0-27-cloud-amd64 ([email protected]) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) Add logo #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01)
  • official docker image downloaded from VIctoria Metrics github repo - the versions I compared 1.108.1, 1.102.1 and 1.101.0 were all official images

@f41gh7
Copy link
Contributor

f41gh7 commented Jan 17, 2025

Related golang issue:
golang/go#51907

@f41gh7
Copy link
Contributor

f41gh7 commented Jan 17, 2025

Minimal reproducible example:

func TestMalfunctional(t *testing.T) {

	cfgStr := `
unauthorized_user:
  url_prefix: {BACKEND}/foo?bar=baz`
	requestURL := "http://some-host.com/abc/def?some_arg=some_value"

	ts := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		h := w.Header()
		h.Set("Connection", "close")
		h.Set("Foo", "bar")

		var bb bytes.Buffer
		if err := r.Header.Write(&bb); err != nil {
			panic(fmt.Errorf("unexpected error when marshaling headers: %w", err))
		}
		fmt.Fprintf(w, "requested_url=http://%s%s\n%s", r.Host, r.URL, bb.String())

	}))
	defer ts.Close()
	cfgStr = strings.ReplaceAll(cfgStr, "{BACKEND}", ts.URL)
	if _, err := reloadAuthConfigData([]byte(cfgStr)); err != nil {
		t.Fatalf("cannot load config data: %s", err)
	}
	body := bytes.NewReader([]byte(strings.Repeat("abcd", 1000)))
	r, err := http.NewRequest(http.MethodGet, requestURL, body)
	if err != nil {
		t.Fatalf("cannot initialize http request: %s", err)
	}

	for range 1000 {
		r.RequestURI = r.URL.RequestURI()
		r.RemoteAddr = "42.2.3.84:6789"
		w := &fakeResponseWriter{}
		if !requestHandler(w, r) {
			t.Fatalf("unexpected false is returned from requestHandler")
		}
		body.Seek(0, 0)

	}
}

f41gh7 added a commit that referenced this issue Jan 21, 2025
Sync.Pool for readTrackingBody was added in order to reduce potential load on garbage collector.
But golang net/http standard library does not allow to reuse request body, since it closes body asynchronously after return.
Related issue:  golang/go#51907

 This commit removes sync.Pool in order to fix potential panic and data race at requests processing.

 Related issue:
#8051

Signed-off-by: f41gh7 <[email protected]>
@hagen1778
Copy link
Collaborator

hagen1778 commented Jan 21, 2025

Hey @bweglarz! Could you please try building vmauth from #8104 and see if issue still remains? I also built an image from that pull request. The pull request removes sync.Pool for request body and it is expected that CPU usage could increase for this new version of vmauth.

@bweglarz
Copy link
Author

Hey @hagen1778! I can definitely give it a try with the image you built since it is ready. I hope to have some feedback by tomorrow afternoon - got some important benchmark running until this evening ;-)

@bweglarz
Copy link
Author

Hi @hagen1778. I have good news - I didn't see the issue occurring when I updated vmauth image to the one provided by you yesterday.

Run a check against vmauth binary:

~ $ /vmauth-prod --version
vmauth-20250121-142712-heads-gh-8051-0-g1ba7df8599

Attaching two screenshots from the test with fixed vmauth - no errors from vminsert and no pod restarted.

Image
Image

hagen1778 added a commit that referenced this issue Jan 22, 2025
hagen1778 added a commit that referenced this issue Jan 22, 2025
Sync.Pool for readTrackingBody was added in order to reduce potential
load on garbage collector. But golang net/http standard library does not
allow to reuse request body, since it closes body asynchronously after
return. Related issue: golang/go#51907

This commit removes sync.Pool in order to fix potential panic and data
race at requests processing.

 Affected releases:
- all releases after v1.97.7

 Related issue:
#8051

Signed-off-by: f41gh7 <[email protected]>
Co-authored-by: Roman Khavronenko <[email protected]>
@hagen1778 hagen1778 added the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 22, 2025
hagen1778 pushed a commit that referenced this issue Jan 22, 2025
Sync.Pool for readTrackingBody was added in order to reduce potential
load on garbage collector. But golang net/http standard library does not
allow to reuse request body, since it closes body asynchronously after
return. Related issue: golang/go#51907

This commit removes sync.Pool in order to fix potential panic and data
race at requests processing.

 Affected releases:
- all releases after v1.97.7

 Related issue:
#8051

Signed-off-by: f41gh7 <[email protected]>
Co-authored-by: Roman Khavronenko <[email protected]>
(cherry picked from commit 80ead7c)
@bweglarz
Copy link
Author

Hey @hagen1778 , just curious - which release will include the fix?

@hagen1778
Copy link
Collaborator

Hey @hagen1778 , just curious - which release will include the fix?

It will get into the next 1.110 and 1.102.11. These releases are expected next week.

@f41gh7
Copy link
Contributor

f41gh7 commented Jan 27, 2025

The bugfix was included into v1.110.0 and v1.102.11

@f41gh7 f41gh7 closed this as completed Jan 27, 2025
@f41gh7 f41gh7 removed the waiting for release The change was merged to upstream, but wasn't released yet. label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working vmauth
Projects
None yet
Development

No branches or pull requests

4 participants