Rearrange CI tasks for safety + efficiency #21639

cevich · 2024-02-13T17:22:54Z

There's are sometimes conflicting purposes in podman CI:

Have the pipeline proceed in an orderly and progressive manner
to sometimes save resources and unnecessary runtime.
Complete all testing as quickly as possible in support of
human-developers moving on to other areas of work.
Ideally/hopefully, accomplish both items above safely,
preventing untested and/or unintended changes from merging.

This commit shifts the balance of these slightly more toward the second
point. It rearranges most CI tasks into essentially three buckets with
a single (new) aggregation task in-between the first two:

Build + Verify all the things
Test all the things
Minor/accessory things

The intention is that while we may unnecessarily spin some number of
testing tasks while others have failed, the best-case scenario
(everything passes) has a much shorter runtime. In other words, it
potentially wastes more resources in favor of a chance to have
developers wait less.

Does this PR introduce a user-facing change?

None

cevich · 2024-02-13T17:24:19Z

Option: It will bring slightly more complexity, but I could make it also require mac, windows, bsd, and the other cross builds as well. But that will haul in a requirement on validate.

packit-as-a-service · 2024-02-13T17:26:57Z

Ephemeral COPR build failed. @containers/packit-build please check.

cevich · 2024-02-13T18:39:47Z

cevich · 2024-02-13T18:43:48Z

@mheon @edsantiago PTAL when you have a sec., what do you think: Should I make 'Total Build Success' also depend on the auxiliary (mac, windows, bsd, etc.) builds?

Context: The tide bot honors the branch-protection rules like its documentation declares.

Assumption: The tide bot is working (prow is VERY complex).

mheon · 2024-02-13T19:34:46Z

Um. I would vote yes, because breaking Mac/Windows is becoming a serious showstopper

cevich · 2024-02-13T19:40:59Z

Okay, I suspected as much. I'll work on re-arranging things...

edsantiago · 2024-02-14T15:35:06Z

I've struggled to reconcile the YAML changes here against the commit message. Forgetting the commit message, my understanding is that this PR:

If either "build" or "build_aarch64" fails, prevent wasting CI cycles on a doomed commit.

Is that accurate? If so, could you please update the commit message and github description? All this "branch protection" and "unbuildable" talk confuses be, because that is ZERO change from the world today, in which no PR can merge because both build and build_aarch64 are required to pass already.

edsantiago · 2024-02-14T16:15:59Z

And, if my interpretation is correct, why not make build_success depend on ALL builds? This diff:

@@ -212,6 +212,9 @@ build_success_task:
     depends_on:
         - build
         - build_aarch64
+        - alt_build
+        - freebsd_alt_build
+        - osx_alt_build
     env:
         CTR_FQIN: ${FEDORA_CONTAINER_FQIN}

(with further diffs trimmed, but obviously need some more s/build_success/build/ to avoid circular dependencies) converts the graph from

to

(diff is the blue boxes, upper left)

cevich · 2024-02-14T16:16:13Z

Is that accurate?

No, sorry, probably it's badly worded.

The point of this PR is to add a new "It builds" gate[0] for the github branch-protection settings, which should in turn also sensitize the tide bot. In other words, another "hoop" for the bot to jump through that will block merging of a PR early in the pipeline, rather than only at the end.

[0] Note: I'm working to add mac/windows/bsd builds as a requirement as per Matt's request.

cevich · 2024-02-14T16:16:46Z

Ugh, our updates crossed. Yes, the second graph is where this is headed.

cevich · 2024-02-14T16:19:45Z

I should clarify something: I have some mistrust in tide being able to block merge based on branch-protection settings. But github definitely will block merge. So having this happen early in the workflow is significant in that way.

edsantiago · 2024-02-14T17:20:57Z

Let me try to ask a different way. WTF does "github branch protection" mean? As a nonprivileged user, I have no idea what rules are in place for this repo. Your newly-added comment reads:

# ....  It relies on github branch-protection blocking based
# on the status of this task _NAME_ (not alias).

but leaves the details to one's imagination. My first guess is that the rule is:

If task name includes the word "Success", and that job fails, block merge

Obviously this is untrue because there's a final task, success_task, with "Success" in the name, and if any build task fails today it will prevent success_task from going green. So adding a new "Success" job does not seem like it will cure anything.

I am missing something very basic, but I don't even know how to guess at what I am missing.

cevich · 2024-02-14T18:07:25Z

I am missing something very basic, but I don't even know how to guess at what I am missing.

Ahh thanks for providing background from your viewpoint. Branch protection has a LOT of settings, many of which we discussed as a team recently and are anti-helpful to developers working on this repository. For example, there's a check box for: Suggest "behind" PRs should rebase, and "require" them to be rebased. Both would seriously hurt getting changes in at scale (several reasons, but mainly that CI would constantly churn on the same set of PRs).

So merge-safety wise, we're almost 100% reliant on another branch-protection setting: "Required Status Checks". In this setting you can pick out previous checks from history, by exact name, that have run and passed, and mark them as merge-blocking. Meaning, github will gray-out the merge button if those specific named tasks do not pass. There's another setting that makes this apply even to those with admin rights.

Supposedly the tide merge bot honors the github branch-protection rules, including the "Required Status Checks" items. But in the past, the bot has allowed (at least one) PR to merge before all tests had passed. More recently, PRs have merged that couldn't even complile. I believe some of this bot-behavior was ultimately my fault (I'll spare the details) but I can't be 100% sure (evidence is not logged). So I don't 100% trust the bot, but I do 100% trust github graying out the merge button.

So all that said, what I want to do here I will spell out more completely in an updated commit message. Hopefully this background and a new picture will help it make more sense...

cevich · 2024-02-14T18:10:38Z

edsantiago · 2024-02-14T18:57:53Z

Strong disagreement. Why did you remove all the "validate"? validate is quick and cheap and a sure-fire requirement for anything. All tests should depend on validate.

edsantiago · 2024-02-14T19:00:54Z

cevich · 2024-02-14T19:14:18Z

validate is quick and cheap

What I observed is validate blocking everything else for nearly 12 minutes.

All tests should depend on validate.

Ultimately they do. If left alone with an invalid change, primary_success will block it. This way is simply optimized more in favor of total runtime vs saving money on VMs. Besides, github will sort any failures toward he top of the checks list (in the PR view). When a developer observes that, they can fix and re-push. All then-obsolete tasks will get automatically canceled.

cevich · 2024-02-14T19:16:41Z

Oh and: See my e-mail about adding an "on-failure" widget. Assuming it's possible, that will also bring the train to a rapid stop if certain key tasks fail.

edsantiago · 2024-02-14T20:43:39Z

Okay, then maybe include validate* in build_success?

Reason:

validate is a blocker. Failing this will require a repush; flakes are not likely here
validate is also a very common PR failure mode: missing 'NO NEW TESTS', or lint, or build-each-commit,
there is literally no point whatsoever to run real tests if validate fails.

I see this as a basic fail-fast best practice.

And of course I still don't understand the point of this PR, since I still don't understand all this "branch protection" talk, and from what little I do understand, it seems like it doesn't work anyway, so adding a new rule is unlikely to make anything work anyway?

edsantiago · 2024-02-14T20:55:48Z

I just noticed an IRC thread, but it's internal so I won't copy-pasted it (even though I see nothing sensitive in it).

I have no idea what all the words mean, but one interpretation is that my hypothesis is correct (the "branch protection" thing is broken, and adding a new "branch protection" thing is likely to suffer from the same breakage). If that is a correct interpretation, maybe we should focus on fixing the breakage? There is still value in rearranging CI jobs, but (IMO) little value in the new build_success task. And if my interpretation is wrong, I would appreciate clarification of what's going on.

cevich · 2024-02-15T21:14:17Z

okay thanks. TBH, I'm not thrilled about all the waste either, it hurts the Earth 😞

I'm letting my subconscious mull over a way we could setup some kind of 'on-failure' E-Stop system for critical tasks. So, for example a validate failure could (somehow) cause the entire Cirrus run to abort. Maybe I/someone can come up with a sexy way to do it.

cevich · 2024-02-16T15:48:56Z

Force push: Worked verification into the build section. Updated commit message. Updated graph:

I have half-a-mind to combine the last section of "extra" tests into the big center blob. But maybe will leave that up to a future commit if it's necessary to shave a few more minutes.

edsantiago · 2024-02-19T13:45:47Z

This is going to make everything slower, not faster: bud tests are 33-35 minutes. As of today, they key on local int, and can run in parallel with other tests. With this PR, they will not even start until all other tests have finished, adding 35 minutes to run time.

Step back: I believe primary_success is no longer necessary. If you agree, please remove it. I do think it's safe and sensible to gate bud_test on local-and-remote-system-test. Perhaps farm, minikube, and upgrade also.

cevich · 2024-02-19T19:32:11Z

There should be a small ~10min gain by moving the machine tests into "the big blob". But yeah, my half-mind + Ed's full mind suggest we should have just one big, main-test blob. Let me do that up and see how it rolls...

cevich · 2024-02-19T19:39:18Z

...okay, I'm going to forego the picture since by now we can all just imagine what what it looks like 😄

edsantiago · 2024-02-19T21:04:06Z

Cirrus clockDurationInSeconds reports 1:11:46, which is (checks calculator) less than the typical 2:00-2:30. Of course this doesn't count the time wasted in retrying flakes.

cevich · 2024-02-20T14:59:15Z

Lol, I just went and looked up that number, then saw your comment 😞 🤣

So best-case scenario CI is almost 50% faster. Worst-case, it still requires humans to notice failed tasks and take action + we probably waste a bunch of VM-time.

@containers/podman-maintainers PTAL what do you think?

cevich · 2024-02-20T14:59:42Z

(re-ran PM hyperv test)

mheon · 2024-02-20T16:58:54Z

After 5.0 we really need a more thorough look at what we're testing and whether we should keep testing, but this should tide us over until then

cevich · 2024-02-20T16:59:29Z

Update: After re-running the PM hyperv test, I noticed cirrus reports a total runtime increase. So it appears Cirrus is counting re-run times. Meaning, the best-case runtime is probably better than an hour.

edsantiago · 2024-02-21T14:58:31Z

LGTM. If this meets with team approval, I'd feel more comfortable if you rebase before final merge.

Current diagram:

There's are sometimes conflicting purposes in podman CI: 1. Have the pipeline proceed in an orderly and progressive manner to sometimes save resources and unnecessary runtime. 2. Complete all testing as quickly as possible in support of human-developers moving on to other areas of work. 3. Ideally/hopefully, accomplish both items above safely, preventing untested and/or unintended changes from merging. This commit shifts the balance of these slightly more toward the second point. It rearranges most CI tasks into essentially three buckets with a single (new) aggregation task in-between the first two: 1. Build + Verify all the things 2. Test all the things 3. Minor/accessory things The intention is that while we may unnecessarily spin some number of testing tasks while others have failed, the best-case scenario (everything passes) has a much shorter runtime. In other words, it potentially wastes more resources in favor of a chance to have developers wait less. Signed-off-by: Chris Evich <[email protected]>

cevich · 2024-02-21T18:27:28Z

Force push: Rebased

TomSweeneyRedHat · 2024-02-21T21:19:08Z

LGTM
one test, "machine-hyperv podman windows rootless host sqlite " between you and happy green test buttons.

cevich · 2024-02-21T21:44:35Z

Thanks whomever smashed the re-run button.

cevich · 2024-02-22T18:31:31Z

This is ready to go now.

mheon · 2024-02-22T18:32:09Z

LGTM

rhatdan · 2024-02-22T21:02:09Z

/lgtm
/approve

openshift-ci · 2024-02-22T21:02:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rhatdan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Feb 13, 2024

cevich force-pushed the require_build branch from 992b154 to c6002a4 Compare February 13, 2024 17:27

openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Feb 13, 2024

cevich changed the title ~~[CI:DOCS] Require build~~ Rearrange CI tasks for safety + efficiency Feb 14, 2024

cevich force-pushed the require_build branch from c6002a4 to 057f907 Compare February 14, 2024 18:09

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2024

cevich force-pushed the require_build branch from 057f907 to 810936d Compare February 14, 2024 18:10

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2024

cevich force-pushed the require_build branch from 810936d to 0519430 Compare February 16, 2024 15:46

cevich force-pushed the require_build branch from 0519430 to ca398b2 Compare February 19, 2024 19:37

cevich marked this pull request as ready for review February 20, 2024 14:59

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2024

cevich force-pushed the require_build branch from ca398b2 to f7d1726 Compare February 21, 2024 18:27

openshift-ci bot assigned rhatdan Feb 22, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2024

openshift-merge-bot bot merged commit bb44510 into containers:main Feb 22, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 23, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators May 23, 2024

Rearrange CI tasks for safety + efficiency #21639

Rearrange CI tasks for safety + efficiency #21639

Uh oh!

Conversation

cevich commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Does this PR introduce a user-facing change?

Uh oh!

cevich commented Feb 13, 2024

Uh oh!

packit-as-a-service bot commented Feb 13, 2024

Uh oh!

cevich commented Feb 13, 2024

Uh oh!

cevich commented Feb 13, 2024

Uh oh!

mheon commented Feb 13, 2024

Uh oh!

cevich commented Feb 13, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

cevich commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

edsantiago commented Feb 14, 2024

Uh oh!

cevich commented Feb 15, 2024

Uh oh!

cevich commented Feb 16, 2024

Uh oh!

edsantiago commented Feb 19, 2024

Uh oh!

cevich commented Feb 19, 2024

Uh oh!

cevich commented Feb 19, 2024

Uh oh!

edsantiago commented Feb 19, 2024

Uh oh!

cevich commented Feb 20, 2024

Uh oh!

cevich commented Feb 20, 2024

Uh oh!

mheon commented Feb 20, 2024

Uh oh!

cevich commented Feb 20, 2024

Uh oh!

edsantiago commented Feb 21, 2024

Uh oh!

cevich commented Feb 21, 2024

Uh oh!

TomSweeneyRedHat commented Feb 21, 2024

Uh oh!

cevich commented Feb 21, 2024

Uh oh!

cevich commented Feb 22, 2024

Uh oh!

mheon commented Feb 22, 2024

Uh oh!

cevich commented Feb 13, 2024 •

edited

Loading