Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die #980

Open
jake-delorme opened this issue Nov 18, 2022 · 19 comments · May be fixed by #1031
Open

Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die #980

jake-delorme opened this issue Nov 18, 2022 · 19 comments · May be fixed by #1031
Labels
bug Something isn't working

Comments

@jake-delorme
Copy link

Long story short

Operator starts up and runs normally as expected. After for running for some time some of the watch streams will throw a RuntimeError("Session is closed"). Once this happens that watch stream will never restart until the operator is restarted. This only appears to happen with custom resources (configmaps are fine).

Kopf version

1.35.6

Kubernetes version

v1.23.13 eks

Python version

3.9.13

Code

import kopf
import kubernetes

@kopf.on.login()
def login_fn(**kwargs):
    return kopf.login_via_client(**kwargs)

@kopf.on.startup()
def operator_configuration(settings: kopf.OperatorSettings, **kwargs):
    settings.watching.server_timeout = 300
    settings.watching.client_timeout = 300
    settings.peering.name = "operator"
    settings.peering.mandatory = True
    settings.peering.stealth = True

@kopf.on.create("MyCustomResource")
@kopf.on.update("MyCustomResource")
def mycustomresource(**_):
    print("Im Alive!")

Logs

[01:55:09,761] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 14.625ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:55:09,842] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06945e0> starts SSL handshake
[01:55:10,002] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06945e0>: SSL handshake took 159.6 ms
[01:55:10,003] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=8, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 41046), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b072b6a0>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b079ae80>)
[01:55:36,674] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[01:55:36,674] kopf.activities.auth [DEBUG   ] Activity 'login_fn' is invoked.
[01:55:36,751] asyncio              [DEBUG   ] <_SelectorSocketTransport fd=6 read=polling write=<idle, bufsize=0>> received EOF
[01:55:36,752] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b06f4820> received EOF
[01:55:36,754] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://mycluster.eks.amazonaws.com/apis/kopf.dev/v1/namespaces/mynamespace/kopfpeerings?watch=true&resourceVersion=13763208&timeoutSeconds=300 -> ServerDisconnectedError('Server disconnected')
[01:55:37,484] kopf.activities.auth [DEBUG   ] Client is configured via kubeconfig file.
[01:55:37,485] kopf.activities.auth [INFO    ] Activity 'login_fn' succeeded.
[01:55:37,485] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[01:55:37,489] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:55:37,490] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 0.790ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:55:37,571] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c57f0> starts SSL handshake
[01:55:37,730] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c57f0>: SSL handshake took 159.4 ms
[01:55:37,731] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 44038), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b078d460>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b059a640>)
[01:55:37,756] kopf._cogs.clients.w [DEBUG   ] Request attempt #2/9: GET https://mycluster.eks.amazonaws.com/apis/kopf.dev/v1/namespaces/mynamespace/kopfpeerings?watch=true&resourceVersion=13763208&timeoutSeconds=300
[01:55:37,756] kopf._cogs.clients.w [DEBUG   ] Stopping the watch-stream for kopfpeerings.v1.kopf.dev in 'mynamespace'.
[01:55:37,759] kopf._core.reactor.o [ERROR   ] Peering observer for operator@mynamespace has failed: Session is closed
Traceback (most recent call last):
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/venv/lib/python3.9/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 82, in infinite_watch
    async for raw_event in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 186, in continuous_watch
    async for raw_input in stream:
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/watching.py", line 251, in watch_objs
    async for raw_input in api.stream(
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
    response = await request(
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/venv/lib/python3.9/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/venv/lib/python3.9/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed
[01:56:04,118] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:56:04,119] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 0.731ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:56:04,200] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c58b0> starts SSL handshake
[01:56:04,359] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c58b0>: SSL handshake took 158.7 ms
[01:56:04,360] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 47442), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b07622e0>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b06a14c0>)
[01:56:56,485] asyncio              [DEBUG   ] Get address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32>
[01:56:56,489] asyncio              [DEBUG   ] Getting address info mycluster.eks.amazonaws.com:443, type=<SocketKind.SOCK_STREAM: 1>, flags=<AddressInfo.AI_ADDRCONFIG: 32> took 3.814ms: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('1.1.1.1', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('2.2.2.2', 443))]
[01:56:56,570] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c5f10> starts SSL handshake
[01:56:56,730] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f13b05c5f10>: SSL handshake took 159.4 ms
[01:56:56,730] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.10.10.10', 53874), raddr=('1.1.1.1', 443)> connected to 1.1.1.1:443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f13b06a1b20>, <aiohttp.client_proto.ResponseHandler object at 0x7f13b07624c0>)

Additional information

This is an EKS cluster and is using aws eks get-token in the kubeconfig to authenticate.

Using aiohttp version 3.8.3
using kubernetes client version: 24.2.0
aws-cli/2.2.43 Python/3.8.8 (used by kubeconfig)

Not all operators watch streams die at the same time. This is not running in a container on the cluster but on a server outside of aws.

@jake-delorme jake-delorme added the bug Something isn't working label Nov 18, 2022
@alexnagy
Copy link

Noticing the same behaviour on EKS with Python 3.11.2, aiohttp 3.8.4, kopf 1.36.0. In my case I am monitoring delete events for Job resources.

What seems to happen is: request fails because of token expiry -> goes to sleep because of backoff -> while sleeping the session gets closed by vault when invalidating previous credentials -> upon waking up and retrying the request it fails with RuntimeError: Session is closed because it still references the closed session.

@Spacca
Copy link

Spacca commented Jun 11, 2023

We are also noticing the same issues but in AKS.

Anybody found a workaround or solution for this problem? Maybe a way to include this in the probes?

python version 3.11.3
kopf version 1.36.1

@nolar
Copy link
Owner

nolar commented Jun 11, 2023

What seems to happen is: request fails because of token expiry -> goes to sleep because of backoff -> while sleeping the session gets closed by vault when invalidating previous credentials -> upon waking up and retrying the request it fails with RuntimeError: Session is closed because it still references the closed session.

Thanks for this hint! Indeed, this is a possible scenario in some (rare?) cases of API failing several concurrent requests.

Can you please try a fix from #1031 — the branch is session-closed-in-reauth.

Depending on your packaging system, that can be installed like this (pip or requirements.txt; different for Poetry):

pip install git+https://github.com/nolar/kopf.git@session-closed-in-reauth

That would be Kopf 1.36.1 plus this fix only.

@Spacca
Copy link

Spacca commented Jun 13, 2023

Hi @nolar

I think the patch doesn't work, I have introduced a couple of probes to check whether the watcher is alive and it seems to always fail.

This happens to us regularly, maybe depending on cluster setup.

In particular i think it all started when we configured:
automountServiceAccountToken=false

And started mounting the token using volumen projection with a duration of 3600s.

Seems like kopf doesn't handle the refresh well.

below my logs:

[2023-06-13 11:08:41,259] aiohttp.access       [INFO    ] 10.251.132.1 [13/Jun/2023:11:08:41 +0000] "GET /healthz HTTP/1.1" 200 229 "-" "kube-probe/1.26"
[2023-06-13 11:09:41,216] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:09:41,238] kopf.activities.prob [INFO    ] Activity 'key' succeeded.
[2023-06-13 11:09:41,238] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:09:41,262] kopf.activities.prob [INFO    ] Activity 'trust' succeeded.
[2023-06-13 11:09:41,262] aiohttp.access       [INFO    ] 10.251.132.1 [13/Jun/2023:11:09:41 +0000] "GET /healthz HTTP/1.1" 200 229 "-" "kube-probe/1.26"
[2023-06-13 11:10:21,045] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2023-06-13 11:10:21,046] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://10.251.0.1:443/apis/com.acme/v1/trusts?watch=true&resourceVersion=1773126&timeoutSeconds=60 -> ClientOSError(1, '[SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2702)')
[2023-06-13 11:10:21,046] kopf._cogs.clients.w [ERROR   ] Request attempt #1/9 failed; will retry: GET https://10.251.0.1:443/apis/com.acme/v1/keys?watch=true&resourceVersion=1773126&timeoutSeconds=60 -> ClientOSError(1, '[SSL: APPLICATION_DATA_AFTER_CLOSE_NOTIFY] application data after close notify (_ssl.c:2702)')
[2023-06-13 11:10:21,048] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-06-13 11:10:21,048] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[2023-06-13 11:10:22,048] kopf._cogs.clients.w [ERROR   ] Request attempt #2/9 failed; will try re-authenticating: GET https://10.251.0.1:443/apis/com.acme/v1/trusts?watch=true&resourceVersion=1773126&timeoutSeconds=60
[2023-06-13 11:10:22,048] kopf._cogs.clients.w [ERROR   ] Request attempt #2/9 failed; will try re-authenticating: GET https://10.251.0.1:443/apis/com.acme/v1/keys?watch=true&resourceVersion=1773126&timeoutSeconds=60
[2023-06-13 11:10:22,049] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
[2023-06-13 11:10:22,050] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
[2023-06-13 11:10:22,050] kopf._core.engines.a [INFO    ] Re-authentication has finished.
[2023-06-13 11:10:22,051] kopf._core.reactor.o [ERROR   ] Watcher for keys.v1.com.acme@none has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/opt/venv/lib/python3.10/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 94, in request
    raise errors.APISessionClosed("Session is closed.") from e
kopf._cogs.clients.errors.APISessionClosed: Session is closed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 47, in wrapper
    await vault.invalidate(key, exc=e)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 291, in invalidate
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2023-06-13 11:10:22,053] kopf._core.reactor.o [ERROR   ] Watcher for trusts.v1.com.acme@none has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 78, in request
    response = await context.session.request(
  File "/opt/venv/lib/python3.10/site-packages/aiohttp/client.py", line 400, in _request
    raise RuntimeError("Session is closed")
RuntimeError: Session is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 94, in request
    raise errors.APISessionClosed("Session is closed.") from e
kopf._cogs.clients.errors.APISessionClosed: Session is closed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 47, in wrapper
    await vault.invalidate(key, exc=e)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 291, in invalidate
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2023-06-13 11:10:41,216] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:10:41,216] kopf.activities.prob [ERROR   ] Activity 'key' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 166, in get_keys
    await check_tasks("keys")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for keys resource type
[2023-06-13 11:10:41,217] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:10:41,217] kopf.activities.prob [ERROR   ] Activity 'trust' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 175, in get_trusts
    await check_tasks("trusts")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for trusts resource type
[2023-06-13 11:11:41,218] kopf.activities.prob [INFO    ] Retrieving all key for probe
[2023-06-13 11:11:41,218] kopf.activities.prob [ERROR   ] Activity 'key' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 166, in get_keys
    await check_tasks("keys")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for keys resource type
[2023-06-13 11:11:41,218] kopf.activities.prob [INFO    ] Retrieving all trust for probe
[2023-06-13 11:11:41,218] kopf.activities.prob [ERROR   ] Activity 'trust' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/src/ctl/kopf_operator.py", line 175, in get_trusts
    await check_tasks("trusts")
  File "/src/ctl/kopf_operator.py", line 187, in check_tasks
    raise GenericException(f'Missing watcher for {resource_type} resource type')
utils.exceptions.GenericException: Missing watcher for trusts resource type
[2023-06-13 11:11:51,842] kopf._core.reactor.r [ERROR   ] Resource observer has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 127, in resource_observer
    await queueing.watcher(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 43, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
  File "/opt/venv/bin/kopf", line 8, in <module>
    sys.exit(main())
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/kopf/cli.py", line 60, in wrapper
    return fn(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/kopf/cli.py", line 109, in run
    return running.run(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 81, in run
    asyncio.run(coro)
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 138, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/running.py", line 419, in run_tasks
    await aiotasks.reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 238, in reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 108, in guard
    await coro
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/observation.py", line 127, in resource_observer
    await queueing.watcher(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
    async for raw_event in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
    async for raw_input in stream:
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
    async for raw_input in api.stream(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 210, in stream
    response = await request(
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 43, in wrapper
    async for key, info, context in vault.extended(APIContext, 'contexts'):
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
    async for key, item in self._items():
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
    yielded_key, yielded_item = self.select()
  File "/opt/venv/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
    raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/

These are my dependencies:

typing_extensions == 4.5.0
requests == 2.28.2
click == 8.1.3
jsonpickle == 3.0.1
pyinstaller == 5.8.0
git+https://github.com/nolar/kopf.git@session-closed-in-reauth
kubernetes == 26.1.0
pyOpenSSL == 23.0.0
pyjks == 20.0.0
pytest == 7.2.1
pytest-mock == 3.10.0
requests-mock == 1.10.0
importlib_resources == 5.12.0

@dmitrytretyakov
Copy link

Faced it as well, it's a big problem for us

@Spacca
did you find some workaround or just live with that?

@jake-delorme
Copy link
Author

We ended up loading the re-authentication hook with a very rudimentary check to force the operator to restart. For us this was every 10 mins. Mileage may vary.

@kopf.on.login()
async def update_credentials(logger, memo: kopf.Memo, **_):
    # This is a dirty hack to try and get the operator to restart when resource observers die
    count = 0
    intrestings_types = ("peering observer", "watcher for", "peering keep-alive", "resource observer", "pause-waiter for")
    for task in asyncio.all_tasks():
        task_type = " ".join(task.get_name().split()[:2])
        if task_type in intrestings_types:
            count += 1
    if memo.get("task_count", 0) <= count:
        logger.debug(f"Watching for {count} intresting tasks. previously {memo.get('task_count')}")
        memo["task_count"] = count
    else:
        logger.error(f"An intresting task went away (were {memo.get('task_count')} now {count}). Exiting the operator")
        exit(1)

@creste
Copy link

creste commented Oct 24, 2024

I started seeing this issue after upgrading to Azure Kubernetes Service (AKS) 1.30. The release notes for AKS 1.30 say:

Starting 1.30 Kubernetes version, apiserver will have --service-account-extend-token-expiration set to false on OIDC issuer enabled clusters. In versions prior to 1.30, service account tokens injected into pods were given an extended lifetime so they remain valid even after a new refreshed token is provided. Prior to upgrading to 1.30, the metric serviceaccount_stale_tokens_total and the audit annotation authentication.k8s.io/stale-token can be used to monitor for workloads that depend on the extended lifetime and are continuing to use tokens even after a refreshed token is provided to the container. If no action is taken, workloads depending on the extended lifetime will break once the cluster is upgraded to 1.30. See reference for details.

Anyone else who tries to use kopf on AKS 1.30+ will also experience this issue. AKS 1.29 goes end-of-life in March 2025, so fixing this is becoming more urgent.

Alternatively, is there a workaround that doesn't require modifying kopf? I am are using dask-operator, which depends on kopf 1.35.3. If there is a workaround that doesn't require modifying kopf then that would be much easier for me to implement.

@cpnielsen
Copy link

We used to have this as a rare occurrence (which we had not yet pinned on this bug). It escalated after a cluster upgrade to AKS 1.30, matching the description provided by @creste.

Are there anyone with a working solution? The original fix PR #1031 looks promising, but apparently not sufficient?

@nolar
Copy link
Owner

nolar commented Mar 7, 2025

A sudden update: I am now jobless, so after 1-2-3 weeks of rest or so, I will be able to look into several issues accumulated here over the past 1-2-3 year —this one included— and maybe catch-up with new K8s features if needed.

… but apparently not sufficient?

Is there some clear repro with 10-100% probability, or a hypothesis what causes the issue? Specifically, what makes the PR #1031 insufficient? I could not find any specifics in the comments above.

@cpnielsen
Copy link

@nolar Sorry about the job situation 😬 The comment was based on the comment by @Spacca, so I cannot quantify it any further. Is there a chance I can get you to rebase it for the latest version? Then I will try it out.

@nolar
Copy link
Owner

nolar commented Mar 7, 2025

@cpnielsen Rebasing should be easy. Let me see… Done. Pushed. The CI is fixed: green again.

@cpnielsen
Copy link

@nolar I finally had time to test this out. In short, the issue is not fixed.

Before:

Watcher for crd@namespace has failed: Session closed.

Now:

Watcher for crd@namespace has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/

For completeness, this is what our login handler looks like (we use kubernetes_asyncio):

import kopf
from kubernetes_asyncio import client, config

# other imports omitted for brevity

@kopf.on.login()
async def authenticate(**_: Any) -> kopf.ConnectionInfo:
    try:
        config.load_incluster_config()
    except config.ConfigException:
        await config.load_kube_config()

    cfg = client.Configuration.get_default_copy()

    # Taken from kopf.piggybacking for sync kubernetes library
    header: Optional[str] = cfg.get_api_key_with_prefix("BearerToken")
    parts: Sequence[str] = header.split(" ", 1) if header else []
    scheme, token = (
        (None, None) if len(parts) == 0 else (None, parts[0]) if len(parts) == 1 else (parts[0], parts[1])
    )  # RFC-7235, Appendix C.

    ci = kopf.ConnectionInfo(
        server=cfg.host,
        ca_path=cfg.ssl_ca_cert,
        insecure=not cfg.verify_ssl,
        username=cfg.username or None,
        password=cfg.password or None,
        scheme=scheme,
        token=token,
        certificate_path=cfg.cert_file,
        private_key_path=cfg.key_file,
        priority=1,
    )

    return ci

@nolar
Copy link
Owner

nolar commented Mar 22, 2025

So far, I can confirm: reproducible with 100%. The setup is somewhat sophisticated, so I am dumping it here to not forget if I get tired this time:


Repro

Preparations:

  • Deploy a new AKS 1.30 cluster (create accounts, etc).
  • Apply RBAC from Kopf docs, replace {{NAMESPACE}} with default.
  • kubectl apply -f examples/crd.yaml
  • kubectl apply -f _pod.yaml

Pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: mydev
spec:
  containers:
  - image: python:3.13
    command: ["/bin/bash"]
#    args: ["-c", "pip install kopf && exec sleep 10d"]
    args: ["-c", "pip install git+https://github.com/nolar/kopf.git@session-closed-in-reauth && exec sleep 10d"]
    name: main
    volumeMounts:
    - mountPath: /var/run/secrets/tokens
      name: vault-token
  serviceAccountName: kopfexample-account
  volumes:
  - name: vault-token
    projected:
      sources:
      - serviceAccountToken:
          path: vault-token
          expirationSeconds: 600  # min 10min

It cannot be made faster than revery 10 mins, so there are long waiting times for the repro.

The local operator code:

import datetime
import os

import kopf


@kopf.on.startup()
def s(settings: kopf.OperatorSettings, **_):
    settings.watching.client_timeout = 10
    settings.networking.request_timeout = 15


@kopf.on.login()
def mylgn(**_) -> kopf.ConnectionInfo:
    with open("/var/run/secrets/tokens/vault-token", encoding="utf-8") as f:
        token = f.read().strip()
    return kopf.ConnectionInfo(
        server=f"https://{os.environ['KUBERNETES_SERVICE_HOST']}:443",
        ca_path="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
        scheme="Bearer",
        token=token,
        priority=int(datetime.datetime.utcnow().timestamp()),
    )


@kopf.on.create('kopfexamples')
def create_fn(spec, **kwargs):
    print(f"And here we are! Creating: {spec}")
    return {'message': 'hello world'}  # will be the new status

Copy the local Kopf & operator code to the pod on every change:

kubectl cp ~/src/kopf/kopf/ default/mydev:/usr/local/lib/python3.13/site-packages/
kubectl cp ~/src/kopf/examples/01-minimal/example.py default/mydev:/tmp/

Run as:

kubectl exec -it pod/mydev -- kopf run -v -n default /tmp/example.py

Observe for minimum 10, usually 11-12 minutes. Since the operator attempts to re-connect every 10 seconds, it will hit the token invalidation soon after the token expires, plus some internal delays.


Suspicion:

The suspected sequence is this:

  • The operator tries to reconnect every once in a whle.
  • The old ConnectionInfo leads to the HTTP 401, which invalidates the credentials in the vault.
  • The whole operator pauses to re-authenticate.
  • A new ConnectionInfo object is produced, let's call it "A".
  • It is presumably (not proven) used, and fails with 401 again. OR: a parallel task initiates a new re-authentication.
  • Ether way, 1s after "A", a new ConnectionInfo is produced, let's call it "B", with exactly the same token content, but different priorities as markers. The old "A" is moved to the "invalid" credentials list.
  • Somwhere here, the stream-watching sequences begin to fail on vault.select() as lacking the valid credentials left, despite "B" is still there.
  • And the operator exits.

The most suspected place (again, not proven): authenticate() — there is a gap between it knows the vault is empty, and till it populates it back. Inbetween, it calls the handlers. During this time, the vault remains unlocked, which means that new incoming API requests/streams will see no credentials available (in Vault._items(), where Vault.select() is called).

It is difficult to wrap my head around this concurrent sequence of events, since I mostly forgot how it works. But the most promising way of fixing would be this:


Possible fixes:

In authenticate(), try to refactor the wait-event into a wait-condition, and put the generation+population of credentials under this condition as a lock.

Alternatively, when invalidating the old credentials, mark the vault as still having hope to get something (pending?), and in populate(), mark it as final. If pending, the select() must wait until it is marked as final. If final, then indeed fail.

All in all, there should be no unprotected gap where the vault believes there are no credentials for the next request.

To be revisited a bit later…

@nolar
Copy link
Owner

nolar commented Mar 26, 2025

For the record: it seems the same issue as in #1158 (after the proposed fix is applied).

@nolar
Copy link
Owner

nolar commented Mar 26, 2025

There is an updated fix (4 commits) in #1031, rebased on the freshmost main branch.

I had to significantly change the credentials vault internals (locks, locks, locks…). For me, the operator now works for 60+ mins in the artificial environment where it was failing 100% every ≈10-20 mins (the lowest possible token lifetime, on the 1st or 2nd re-auth).

I would appreciate it if you test it somewhere where the error was happening. Please mind that the code is yet drafty and experimental, so do not push it to production.

Either way, I am off for now. If there are any issues, I can take a look next week the earliest.

@DCkQ6
Copy link

DCkQ6 commented Apr 3, 2025

I preperd test image of my operator with kopf from branch session-closed-in-reauth and deployed it to your production like staging environment, where we were experiencing related issue #1158 on AKS 1.30.

It looks like this helped, below logs from reauthentication event that previosly was causing a restart.

2025-04-03 16:38:00.525	[2025-04-03 14:38:00,524] kopf.objects         [INFO    ] [mycompany-someobject/someobject-public] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.534	[2025-04-03 14:38:00,534] kopf.objects         [INFO    ] [mycompany-someobject/someobject-ui] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.551	[2025-04-03 14:38:00,551] kopf.objects         [INFO    ] [mycompany-anotherobject/anotherobject-ui] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.553	[2025-04-03 14:38:00,553] kopf.objects         [INFO    ] [mycompany-anotherobject/profiles-api] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.556	[2025-04-03 14:38:00,555] kopf.objects         [INFO    ] [mycompany-someobject/jetanotherobject] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.560	[2025-04-03 14:38:00,560] kopf.objects         [INFO    ] [mycompany-someobject/statics] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.560	[2025-04-03 14:38:00,560] kopf.objects         [INFO    ] [mycompany-someobject/anotherobject-ui] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:00.589	[2025-04-03 14:38:00,589] kopf._core.engines.a [INFO    ] Re-authentication has been initiated.
2025-04-03 16:38:00.589	[2025-04-03 14:38:00,589] kopf.objects         [ERROR   ] [mycompany-someobject/someobject-public] Request attempt #1/9 failed; SSL closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/someobject-public
2025-04-03 16:38:00.589	[2025-04-03 14:38:00,589] kopf.objects         [ERROR   ] [mycompany-anotherobject/anotherobject-ui] Request attempt #1/9 failed; SSL closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-anotherobject/publicendpoints/anotherobject-ui
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.objects         [ERROR   ] [mycompany-someobject/jetanotherobject] Request attempt #1/9 failed; will retry: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/jetanotherobject -> ServerDisconnectedError('Server disconnected')
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.objects         [ERROR   ] [mycompany-someobject/someobject-public-api] Request attempt #1/9 failed; will retry: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/someobject-public-api -> ServerDisconnectedError('Server disconnected')
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.activities.auth [INFO    ] Activity 'login_via_client' succeeded.
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf._core.engines.a [INFO    ] Re-authentication has finished.
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.objects         [ERROR   ] [mycompany-someobject/statics] Request attempt #1/9 failed; will retry: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/statics -> ServerDisconnectedError('Server disconnected')
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.objects         [ERROR   ] [mycompany-someobject/someobject-ui] Request attempt #1/9 failed; will retry: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/someobject-ui -> ServerDisconnectedError('Server disconnected')
2025-04-03 16:38:00.591	[2025-04-03 14:38:00,591] kopf.objects         [ERROR   ] [mycompany-anotherobject/anotherobject-request-someobject] Request attempt #1/9 failed; SSL closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-anotherobject/publicendpoints/anotherobject-request-someobject
2025-04-03 16:38:00.593	[2025-04-03 14:38:00,593] kopf.objects         [ERROR   ] [mycompany-someobject/anotherobject-ui] Request attempt #1/9 failed; will retry: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/anotherobject-ui -> ServerDisconnectedError('Server disconnected')
2025-04-03 16:38:01.358	[2025-04-03 14:38:01,345] kopf.objects         [INFO    ] [mycompany-someobject/someobject] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:01.410	[2025-04-03 14:38:01,410] kopf.objects         [INFO    ] [mycompany-someobject/someobject-proxy] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:01.627	[2025-04-03 14:38:01,623] kopf.objects         [ERROR   ] [mycompany-someobject/jetanotherobject] Request attempt #2/9 failed; TCP closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/jetanotherobject
2025-04-03 16:38:01.627	[2025-04-03 14:38:01,624] kopf.objects         [ERROR   ] [mycompany-someobject/someobject-public-api] Request attempt #2/9 failed; TCP closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/someobject-public-api
2025-04-03 16:38:01.627	[2025-04-03 14:38:01,625] kopf.objects         [ERROR   ] [mycompany-someobject/statics] Request attempt #2/9 failed; TCP closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/statics
2025-04-03 16:38:01.627	[2025-04-03 14:38:01,625] kopf.objects         [ERROR   ] [mycompany-someobject/someobject-ui] Request attempt #2/9 failed; TCP closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/someobject-ui
2025-04-03 16:38:01.627	[2025-04-03 14:38:01,626] kopf.objects         [ERROR   ] [mycompany-someobject/anotherobject-ui] Request attempt #2/9 failed; TCP closed; will re-authenticate: PATCH https://10.0.0.1:443/apis/my.kind/v1alpha1/namespaces/mycompany-someobject/publicendpoints/anotherobject-ui
2025-04-03 16:38:01.934	[2025-04-03 14:38:01,934] kopf.objects         [INFO    ] [mycompany-someobject/api-gateway] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:01.971	[2025-04-03 14:38:01,970] kopf.objects         [INFO    ] [mycompany-someobject/administration-ui] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:03.508	[2025-04-03 14:38:03,507] kopf.objects         [INFO    ] [mycompany-someobject/someobject-js] Timer 'important-stuff-reload' succeeded.
2025-04-03 16:38:09.782	[2025-04-03 14:38:09,781] kopf.objects         [INFO    ] [mycompany-someobject/anotherobject] Timer 'important-stuff-reload' succeeded.
// All consecutive execution of Timer 'important-stuff-reload' succeeded.

Now, after finished Re-authentication, all resources are handled correclty 👍

One thing that could be improved before the production release is the log level/noticeability. Right now, ERROR looks pretty alarming but it's not as everything works as expected.

@nolar
Copy link
Owner

nolar commented Apr 3, 2025

@DCkQ6 Thank you for the feedback! That is really helpful to see how it works and solves the issue. Therefore, I promote the fix from the "supposed" to the "working" level.


You are right, the massive errors do not look good. In my typical scale of log levels, ERRORs are something that require the near- or mid-term attention of humans (maybe even an alarm in the monitoring system). This case does not require any attention/intervention, so at most WARNING, but probably just INFO/DEBUG (expected self-resolved case).

I will think on what can be done with that. Thoughts written down, for the record:

When we have N watch-stream using the same credentials object (ConnectionInfo/VaultItem), the first one to fail —which is essentially random— will "invalidate" the credential in the vault (via Vault.invalidate(info)). This, under the hood, will release all resources owned by that credentials objects, including the attached aiohttp.Session — by closing it. Closing the session, in turn, closes the TCP-level connections either from the client side, or from the server side — which pops up in other watch-streams as these "server disconnected" errors.

I see 2 ways out:

  • Gracefully notify all other watch-streams of the same credentials/session object about its invalidated state, and take this state into account when retrying/reconnecting in the error-retry cycle or even logging the errors. Now, session.closed is de facto used as such a flag, but still treated as an error. But it can be closed by any other reasons (tcp-/http-/aiohttp-specific), so we should better have our own flag.
  • Keep the counter of actual users-consumers of the credentials, and release the resources only when the counter reaches zero. Basically, a classic garbage collector. Risks: not getting a signal that the user-consumer (a watch-stream) failed and is not using the session anymore — unless explicitly intercepted and notified. Seems complex & unreliable.

I am currently unsure about pros & cons of each approach, I will try them in practice in the dev env where I had this case reproduced, and will choose something. Probably as a separate PR, just for the sanity of code changesets.

@DCkQ6
Copy link

DCkQ6 commented Apr 4, 2025

Bad news, there is probably memory leak, below memory usage chart of my operator:
Image

  • Blue one is version before fix - kopf version 1.37.2, spikes are restarts due to fail in reauthentication.
  • Green and Yellow are after upgrade, Its the same version, grafana just color coded diffrent pods name.

Could you please take a look?

@nolar
Copy link
Owner

nolar commented Apr 4, 2025

@DCkQ6 Sure. I’ll take a look. Thanks for catching this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants