Skip to content

Retry pr_time_benchmarks when it fails #6005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 4, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions torchci/lib/bot/retryBot.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@ import { CachedConfigTracker } from "./utils";
const SUCCESS_CONCLUSIONS = ["success"];
const FAILURE_CONCLUSIONS = ["failure", "cancelled", "timed_out"];

// If these jobs fail, they will always be retried
const KNOWN_FLAKY_JOBS = [
// From @laithsakka, we want to retry this job in a different runner as it could
// fail flakily sometimes
"pr_time_benchmarks",
];

async function getFlakyJobsFromPreviousWorkflow(
owner: string,
repo: string,
Expand Down Expand Up @@ -120,6 +127,13 @@ async function retryCurrentWorkflow(
return false;
}

for (const flakyJobName of KNOWN_FLAKY_JOBS) {
// if the job is a known flaky one, we want to retry it
if (job.name.toLocaleLowerCase().includes(flakyJobName)) {
return true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to what the intention is here? Specifically, why is it after all the other checks, is it just that this wants to rerun even if the test step failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm trying to make this job eligible for retry even if its test step fails. We don't retry that atm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another word, if the job is pr_time_benchmarks job, it will always be retried one more time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case maybe a list name like "ALWAYS_RETRY" would make more sense then, since KNOWN_FLAKY_JOBS makes me think that you know it's probably flaky so it's ok to never retry it. Doesn't really matter tho, since the comment explains


// if no test steps failed, can rerun
return !doesLookLikeUserFailure(job, (step) =>
step.name.toLowerCase().includes("test")
Expand Down
40 changes: 40 additions & 0 deletions torchci/test/retryBot.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,46 @@ describe("retry-bot", () => {
handleScope(scope);
});

test("rerun known flaky jobs", async () => {
const event = requireDeepCopy("./fixtures/workflow_run.completed.json");
event.payload.workflow_run.name = "pull";
const workflow_jobs = requireDeepCopy("./fixtures/workflow_jobs.json");
workflow_jobs.jobs[4].name = `${workflow_jobs.jobs[0].name} (pr_time_benchmarks)`;
workflow_jobs.jobs[4].conclusion = "failure";
workflow_jobs.jobs[4].steps[0].conclusion = "failure";

const owner = event.payload.repository.owner.login;
const repo = event.payload.repository.name;
const attempt_number = event.payload.workflow_run.run_attempt;
const run_id = event.payload.workflow_run.id;

const scope = nock("https://api.github.com")
.get(
`/repos/${owner}/${repo}/actions/runs/${run_id}/attempts/${attempt_number}/jobs?page=1&per_page=100`
)
.reply(200, workflow_jobs)
.get(
`/repos/${owner}/${repo}/contents/${encodeURIComponent(
".github/pytorch-probot.yml"
)}`
)
.reply(
200,
'{retryable_workflows: ["pull", "trunk", "linux-binary", "windows-binary"]}'
)
.post(
`/repos/${owner}/${repo}/actions/jobs/${workflow_jobs.jobs[4].id}/rerun`
)
.reply(200);

const mock = jest.spyOn(clickhouse, "queryClickhouseSaved");
mock.mockImplementation(() => Promise.resolve([]));

await probot.receive(event);

handleScope(scope);
});

test("rerun previous workflow if it has more than one flaky jobs in trunk", async () => {
const event = requireDeepCopy("./fixtures/workflow_run.completed.json");
event.payload.workflow_run.name = "pull";
Expand Down
Loading