Description
Is your feature request related to a problem? Please describe.
Over the last 633 successful runs, the cron-conda
job has a maximum runtime of 40 minutes (mean=23, std=2) across all matrix combinations.
However, there are failed runs that fail after reaching the threshold of 6 hours that GitHub imposes. In other words, these jobs seem to get stuck, possibly for external or random reasons.
One such example is this job run, that failed after 6 hours. More stuck jobs have been observed over the last six months, the first one on 11-Jan-2025 and the last one one on 17-Apr-2025, while more recent occurences are also possible because our dataset has a cutoff date around late May. With the proposed changes (see below), a total of 145 hours would have been saved over the last six months retrospectively, clearing the queue for other workflows and speeding up the CI of the project, while also saving resources in general 🌱.
Describe the solution you'd like
The idea is to set a timeout to stop jobs that run much longer than their historical maximum, because such jobs are probably stuck and will simply fail with a timeout at 6 hours.
Our PR will propose to set the timeout to max + 3*std = 46 minutes
where max
and std
(standard deviation) are derived from the history of 633 successful runs. This will provide sufficient margin if the workflow gets naturally slower in the future, but if you would prefer lower/higher threshold we would be happy to do it.
Note that the timeout applies to all the matrix jobs, and not to their sum, overriding the default 6-hour timeout of github.
Additional context
Hi,
We are a team of researchers from University of Zurich and we are currently working on energy optimizations in GitHub Actions workflows.
Thanks for your time on this.
Feel free to let us know (here or in the email below) if you have any questions, and thanks for putting in the time to read this.
Best regards,
Konstantinos Kitsios
[email protected]