Allow for manual execution of long running scripts (in MIT Learn) #3189

shanbady · 2025-05-07T16:15:22Z

Description/Context

There are times where we will need to ssh into a pod in order to run (potentially long running) scripts via django shell. Currently it seems like the pods are frequently culled and forces me re-run kubectl get pods -n mitlearn find a valid pod name and then ssh back in - only to have it again culled moments later.

We need some way to allow for instances of this where a developer will need to ssh in and run some long running script(s).

The other potential problem case (havnt confirmed this) is with the celery worker pods - There are some tasks such as etl pipelines and embeddings that take a while to run - the tasks themselves are resilient to restarts but if the pods are too ephemeral, i can see this causing certain celery tasks to endlessly restart. On heroku there was something similar happening on rc (but that was due to resource constraints)

The text was updated successfully, but these errors were encountered:

feoh · 2025-05-07T16:25:55Z

OK one response: There is a way to schedule long running jobs in kubernetes which can't be killed. You can find the recipe here. If you want the process to NOT be attached to your tty so you can log out, walk away, etc omit the --tty, but then you won't get immediate interactive output and will need to query the logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for manual execution of long running scripts (in MIT Learn) #3189

Allow for manual execution of long running scripts (in MIT Learn) #3189

shanbady commented May 7, 2025

feoh commented May 7, 2025

Allow for manual execution of long running scripts (in MIT Learn) #3189

Allow for manual execution of long running scripts (in MIT Learn) #3189

Comments

shanbady commented May 7, 2025

Description/Context

feoh commented May 7, 2025