You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are times where we will need to ssh into a pod in order to run (potentially long running) scripts via django shell. Currently it seems like the pods are frequently culled and forces me re-run kubectl get pods -n mitlearn find a valid pod name and then ssh back in - only to have it again culled moments later.
We need some way to allow for instances of this where a developer will need to ssh in and run some long running script(s).
The other potential problem case (havnt confirmed this) is with the celery worker pods - There are some tasks such as etl pipelines and embeddings that take a while to run - the tasks themselves are resilient to restarts but if the pods are too ephemeral, i can see this causing certain celery tasks to endlessly restart. On heroku there was something similar happening on rc (but that was due to resource constraints)
The text was updated successfully, but these errors were encountered:
OK one response: There is a way to schedule long running jobs in kubernetes which can't be killed. You can find the recipe here. If you want the process to NOT be attached to your tty so you can log out, walk away, etc omit the --tty, but then you won't get immediate interactive output and will need to query the logs.
Description/Context
There are times where we will need to ssh into a pod in order to run (potentially long running) scripts via django shell. Currently it seems like the pods are frequently culled and forces me re-run
kubectl get pods -n mitlearn
find a valid pod name and then ssh back in - only to have it again culled moments later.We need some way to allow for instances of this where a developer will need to ssh in and run some long running script(s).
The other potential problem case (havnt confirmed this) is with the celery worker pods - There are some tasks such as etl pipelines and embeddings that take a while to run - the tasks themselves are resilient to restarts but if the pods are too ephemeral, i can see this causing certain celery tasks to endlessly restart. On heroku there was something similar happening on rc (but that was due to resource constraints)
The text was updated successfully, but these errors were encountered: