Skip to content

otlp Exporter do not retry when send error many times in a short time #1411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shabicheng opened this issue Dec 18, 2020 · 2 comments
Closed
Labels
bug Something isn't working pkg:exporter:otlp Related to the OTLP exporter package
Milestone

Comments

@shabicheng
Copy link

General

Environment

  1. opentelemetry-golang version : v0.15.0

Reproduce

This can be reproduced every time.

  1. If the otlp.Exporter's endpoint fails, the exporter return error like 'rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp xx.xx.xx.xx:55680: connect: connection refused"'
  2. And if the SDK sends some data continuously, the exporter will never send it again.

Problem

After debugging the code, I find the grpc indefiniteBackgroundConnection don't connect anymore.

The root cause is grpcconnection use a none buffered chan for disconnectedCh:

https://github.com/open-telemetry/opentelemetry-go/blob/master/exporters/otlp/connection.go#L73

And if setStateDisconnected is called in indefiniteBackgroundConnection, the chan is not triggerd and indefiniteBackgroundConnection will always block in select.

The debug code :

func (oc *grpcConnection) setStateDisconnected(err error) {
	oc.saveLastConnectError(err)
	fmt.Println("connect fail")
	select {
	case oc.disconnectedCh <- true:
		fmt.Println("signal success")
	default:
		fmt.Println("signal fail")
	}
	_ = oc.newConnectionHandler(nil)
}

Logs, after signal fail, the exporter do not work anymore :

connect success 
ts=2020-12-18T06:14:01.793914Z caller=middlewares.go:129 method=GetCards id=all result=0 took=1.16µs
ts=2020-12-18T06:14:01.793959Z caller=error_handler.go:27 err="db failed"
connect fail
signal success
2020/12/18 14:14:01 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:55680: connect: connection refused"
connect 
connect success 
ts=2020-12-18T06:14:05.86812Z caller=middlewares.go:129 method=GetCards id=all result=0 took=978ns
ts=2020-12-18T06:14:05.86816Z caller=error_handler.go:27 err="db failed"
connect fail
signal success
2020/12/18 14:14:05 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:55680: connect: connection refused"
connect 
connect success 
ts=2020-12-18T06:14:06.450889Z caller=middlewares.go:129 method=GetCards id=all result=0 took=1.234µs
ts=2020-12-18T06:14:06.450928Z caller=error_handler.go:27 err="db failed"
connect fail
signal fail
2020/12/18 14:14:06 rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:55680: connect: connection refused"
ts=2020-12-18T06:14:07.955822Z caller=middlewares.go:129 method=GetCards id=all result=0 took=1.037µs
ts=2020-12-18T06:14:07.955854Z caller=error_handler.go:27 err="db failed"
ts=2020-12-18T06:14:08.552295Z caller=middlewares.go:129 method=GetCards id=all result=0 took=930ns
ts=2020-12-18T06:14:08.552326Z caller=error_handler.go:27 err="db failed"
ts=2020-12-18T06:15:27.996795Z caller=middlewares.go:129 method=GetCards id=all result=0 took=1.071µs
ts=2020-12-18T06:15:27.996841Z caller=error_handler.go:27 err="db failed"

How To Fix

After change disconnectedCh to a buffered chan(oc.disconnectedCh = make(chan bool, 1)), the connection behaviour change to normal.

@MrAlias MrAlias added bug Something isn't working pkg:exporter labels Jan 7, 2021
@Aneurysm9
Copy link
Member

@shabicheng the OTLP exporter's gRPC connection handling was just reworked in #1420. Are you able to test against the current HEAD to see if this issue remains?

@MrAlias MrAlias added pkg:exporter:otlp Related to the OTLP exporter package and removed pkg:exporter labels Apr 6, 2021
@MrAlias
Copy link
Contributor

MrAlias commented Oct 15, 2021

I was able to reproduce this with v0.15.0. I ran a client that continuously sent spans to a collector and when the collector was stopped and restarted, the client never reconnected.

This behavior was not reproduced with v1.0.1. Once the collector came back up the client reconnected and spans started being transmitted.

I'm going to close this based on these findings.

@MrAlias MrAlias closed this as completed Oct 15, 2021
@pellared pellared moved this to Closed in Go: Triage Nov 2, 2023
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pkg:exporter:otlp Related to the OTLP exporter package
Projects
Archived in project
Development

No branches or pull requests

4 participants