-
-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watchers on Custom Resources throw RuntimeError("Session is closed") and permanently die #980
Comments
Noticing the same behaviour on EKS with Python 3.11.2, aiohttp 3.8.4, kopf 1.36.0. In my case I am monitoring delete events for Job resources. What seems to happen is: request fails because of token expiry -> goes to sleep because of backoff -> while sleeping the session gets closed by vault when invalidating previous credentials -> upon waking up and retrying the request it fails with |
We are also noticing the same issues but in AKS. Anybody found a workaround or solution for this problem? Maybe a way to include this in the probes? python version 3.11.3 |
Thanks for this hint! Indeed, this is a possible scenario in some (rare?) cases of API failing several concurrent requests. Can you please try a fix from #1031 — the branch is Depending on your packaging system, that can be installed like this (pip or requirements.txt; different for Poetry):
That would be Kopf 1.36.1 plus this fix only. |
Hi @nolar I think the patch doesn't work, I have introduced a couple of probes to check whether the watcher is alive and it seems to always fail. This happens to us regularly, maybe depending on cluster setup. In particular i think it all started when we configured: And started mounting the token using volumen projection with a duration of 3600s. Seems like kopf doesn't handle the refresh well. below my logs:
These are my dependencies:
|
Faced it as well, it's a big problem for us @Spacca |
We ended up loading the re-authentication hook with a very rudimentary check to force the operator to restart. For us this was every 10 mins. Mileage may vary.
|
I started seeing this issue after upgrading to Azure Kubernetes Service (AKS) 1.30. The release notes for AKS 1.30 say:
Anyone else who tries to use kopf on AKS 1.30+ will also experience this issue. AKS 1.29 goes end-of-life in March 2025, so fixing this is becoming more urgent. Alternatively, is there a workaround that doesn't require modifying |
A sudden update: I am now jobless, so after 1-2-3 weeks of rest or so, I will be able to look into several issues accumulated here over the past 1-2-3 year —this one included— and maybe catch-up with new K8s features if needed.
Is there some clear repro with 10-100% probability, or a hypothesis what causes the issue? Specifically, what makes the PR #1031 insufficient? I could not find any specifics in the comments above. |
@cpnielsen Rebasing should be easy. Let me see… Done. Pushed. The CI is fixed: green again. |
@nolar I finally had time to test this out. In short, the issue is not fixed. Before:
Now:
For completeness, this is what our login handler looks like (we use kubernetes_asyncio): import kopf
from kubernetes_asyncio import client, config
# other imports omitted for brevity
@kopf.on.login()
async def authenticate(**_: Any) -> kopf.ConnectionInfo:
try:
config.load_incluster_config()
except config.ConfigException:
await config.load_kube_config()
cfg = client.Configuration.get_default_copy()
# Taken from kopf.piggybacking for sync kubernetes library
header: Optional[str] = cfg.get_api_key_with_prefix("BearerToken")
parts: Sequence[str] = header.split(" ", 1) if header else []
scheme, token = (
(None, None) if len(parts) == 0 else (None, parts[0]) if len(parts) == 1 else (parts[0], parts[1])
) # RFC-7235, Appendix C.
ci = kopf.ConnectionInfo(
server=cfg.host,
ca_path=cfg.ssl_ca_cert,
insecure=not cfg.verify_ssl,
username=cfg.username or None,
password=cfg.password or None,
scheme=scheme,
token=token,
certificate_path=cfg.cert_file,
private_key_path=cfg.key_file,
priority=1,
)
return ci |
So far, I can confirm: reproducible with 100%. The setup is somewhat sophisticated, so I am dumping it here to not forget if I get tired this time: ReproPreparations:
Pod.yaml: apiVersion: v1
kind: Pod
metadata:
name: mydev
spec:
containers:
- image: python:3.13
command: ["/bin/bash"]
# args: ["-c", "pip install kopf && exec sleep 10d"]
args: ["-c", "pip install git+https://github.com/nolar/kopf.git@session-closed-in-reauth && exec sleep 10d"]
name: main
volumeMounts:
- mountPath: /var/run/secrets/tokens
name: vault-token
serviceAccountName: kopfexample-account
volumes:
- name: vault-token
projected:
sources:
- serviceAccountToken:
path: vault-token
expirationSeconds: 600 # min 10min It cannot be made faster than revery 10 mins, so there are long waiting times for the repro. The local operator code: import datetime
import os
import kopf
@kopf.on.startup()
def s(settings: kopf.OperatorSettings, **_):
settings.watching.client_timeout = 10
settings.networking.request_timeout = 15
@kopf.on.login()
def mylgn(**_) -> kopf.ConnectionInfo:
with open("/var/run/secrets/tokens/vault-token", encoding="utf-8") as f:
token = f.read().strip()
return kopf.ConnectionInfo(
server=f"https://{os.environ['KUBERNETES_SERVICE_HOST']}:443",
ca_path="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",
scheme="Bearer",
token=token,
priority=int(datetime.datetime.utcnow().timestamp()),
)
@kopf.on.create('kopfexamples')
def create_fn(spec, **kwargs):
print(f"And here we are! Creating: {spec}")
return {'message': 'hello world'} # will be the new status Copy the local Kopf & operator code to the pod on every change: kubectl cp ~/src/kopf/kopf/ default/mydev:/usr/local/lib/python3.13/site-packages/
kubectl cp ~/src/kopf/examples/01-minimal/example.py default/mydev:/tmp/ Run as: kubectl exec -it pod/mydev -- kopf run -v -n default /tmp/example.py Observe for minimum 10, usually 11-12 minutes. Since the operator attempts to re-connect every 10 seconds, it will hit the token invalidation soon after the token expires, plus some internal delays. Suspicion:The suspected sequence is this:
The most suspected place (again, not proven): It is difficult to wrap my head around this concurrent sequence of events, since I mostly forgot how it works. But the most promising way of fixing would be this: Possible fixes:In Alternatively, when invalidating the old credentials, mark the vault as still having hope to get something (pending?), and in All in all, there should be no unprotected gap where the vault believes there are no credentials for the next request. To be revisited a bit later… |
For the record: it seems the same issue as in #1158 (after the proposed fix is applied). |
There is an updated fix (4 commits) in #1031, rebased on the freshmost main branch. I had to significantly change the credentials vault internals (locks, locks, locks…). For me, the operator now works for 60+ mins in the artificial environment where it was failing 100% every ≈10-20 mins (the lowest possible token lifetime, on the 1st or 2nd re-auth). I would appreciate it if you test it somewhere where the error was happening. Please mind that the code is yet drafty and experimental, so do not push it to production. Either way, I am off for now. If there are any issues, I can take a look next week the earliest. |
I preperd test image of my operator with kopf from branch It looks like this helped, below logs from reauthentication event that previosly was causing a restart.
Now, after finished Re-authentication, all resources are handled correclty 👍 One thing that could be improved before the production release is the log level/noticeability. Right now, |
@DCkQ6 Thank you for the feedback! That is really helpful to see how it works and solves the issue. Therefore, I promote the fix from the "supposed" to the "working" level. You are right, the massive errors do not look good. In my typical scale of log levels, ERRORs are something that require the near- or mid-term attention of humans (maybe even an alarm in the monitoring system). This case does not require any attention/intervention, so at most WARNING, but probably just INFO/DEBUG (expected self-resolved case). I will think on what can be done with that. Thoughts written down, for the record: When we have N watch-stream using the same credentials object ( I see 2 ways out:
I am currently unsure about pros & cons of each approach, I will try them in practice in the dev env where I had this case reproduced, and will choose something. Probably as a separate PR, just for the sanity of code changesets. |
@DCkQ6 Sure. I’ll take a look. Thanks for catching this. |
Long story short
Operator starts up and runs normally as expected. After for running for some time some of the watch streams will throw a
RuntimeError("Session is closed")
. Once this happens that watch stream will never restart until the operator is restarted. This only appears to happen with custom resources (configmaps are fine).Kopf version
1.35.6
Kubernetes version
v1.23.13 eks
Python version
3.9.13
Code
Logs
Additional information
This is an EKS cluster and is using aws eks get-token in the kubeconfig to authenticate.
Using aiohttp version 3.8.3
using kubernetes client version: 24.2.0
aws-cli/2.2.43 Python/3.8.8 (used by kubeconfig)
Not all operators watch streams die at the same time. This is not running in a container on the cluster but on a server outside of aws.
The text was updated successfully, but these errors were encountered: