-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Rearrange CI tasks for safety + efficiency #21639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Option: It will bring slightly more complexity, but I could make it also require mac, windows, bsd, and the other cross builds as well. But that will haul in a requirement on |
Ephemeral COPR build failed. @containers/packit-build please check. |
992b154
to
c6002a4
Compare
@mheon @edsantiago PTAL when you have a sec., what do you think: Should I make 'Total Build Success' also depend on the auxiliary (mac, windows, bsd, etc.) builds? Context: The tide bot honors the branch-protection rules like its documentation declares. Assumption: The tide bot is working (prow is VERY complex). |
Um. I would vote yes, because breaking Mac/Windows is becoming a serious showstopper |
Okay, I suspected as much. I'll work on re-arranging things... |
I've struggled to reconcile the YAML changes here against the commit message. Forgetting the commit message, my understanding is that this PR:
Is that accurate? If so, could you please update the commit message and github description? All this "branch protection" and "unbuildable" talk confuses be, because that is ZERO change from the world today, in which no PR can merge because both build and build_aarch64 are required to pass already. |
No, sorry, probably it's badly worded. The point of this PR is to add a new "It builds" gate[0] for the github branch-protection settings, which should in turn also sensitize the [0] Note: I'm working to add mac/windows/bsd builds as a requirement as per Matt's request. |
Ugh, our updates crossed. Yes, the second graph is where this is headed. |
I should clarify something: I have some mistrust in tide being able to block merge based on branch-protection settings. But github definitely will block merge. So having this happen early in the workflow is significant in that way. |
Let me try to ask a different way. WTF does "github branch protection" mean? As a nonprivileged user, I have no idea what rules are in place for this repo. Your newly-added comment reads:
but leaves the details to one's imagination. My first guess is that the rule is:
Obviously this is untrue because there's a final task, I am missing something very basic, but I don't even know how to guess at what I am missing. |
Ahh thanks for providing background from your viewpoint. Branch protection has a LOT of settings, many of which we discussed as a team recently and are anti-helpful to developers working on this repository. For example, there's a check box for: Suggest "behind" PRs should rebase, and "require" them to be rebased. Both would seriously hurt getting changes in at scale (several reasons, but mainly that CI would constantly churn on the same set of PRs). So merge-safety wise, we're almost 100% reliant on another branch-protection setting: "Required Status Checks". In this setting you can pick out previous checks from history, by exact name, that have run and passed, and mark them as merge-blocking. Meaning, github will gray-out the merge button if those specific named tasks do not pass. There's another setting that makes this apply even to those with admin rights. Supposedly the So all that said, what I want to do here I will spell out more completely in an updated commit message. Hopefully this background and a new picture will help it make more sense... |
c6002a4
to
057f907
Compare
057f907
to
810936d
Compare
Strong disagreement. Why did you remove all the "validate"? validate is quick and cheap and a sure-fire requirement for anything. All tests should depend on validate. |
What I observed is validate blocking everything else for nearly 12 minutes.
Ultimately they do. If left alone with an invalid change, |
Oh and: See my e-mail about adding an "on-failure" widget. Assuming it's possible, that will also bring the train to a rapid stop if certain key tasks fail. |
I just noticed an IRC thread, but it's internal so I won't copy-pasted it (even though I see nothing sensitive in it). I have no idea what all the words mean, but one interpretation is that my hypothesis is correct (the "branch protection" thing is broken, and adding a new "branch protection" thing is likely to suffer from the same breakage). If that is a correct interpretation, maybe we should focus on fixing the breakage? There is still value in rearranging CI jobs, but (IMO) little value in the new |
okay thanks. TBH, I'm not thrilled about all the waste either, it hurts the Earth 😞 I'm letting my subconscious mull over a way we could setup some kind of 'on-failure' E-Stop system for critical tasks. So, for example a |
810936d
to
0519430
Compare
There should be a small ~10min gain by moving the machine tests into "the big blob". But yeah, my half-mind + Ed's full mind suggest we should have just one big, main-test blob. Let me do that up and see how it rolls... |
0519430
to
ca398b2
Compare
...okay, I'm going to forego the picture since by now we can all just imagine what what it looks like 😄 |
Cirrus |
Lol, I just went and looked up that number, then saw your comment 😞 🤣 So best-case scenario CI is almost 50% faster. Worst-case, it still requires humans to notice failed tasks and take action + we probably waste a bunch of VM-time. @containers/podman-maintainers PTAL what do you think? |
(re-ran PM hyperv test) |
After 5.0 we really need a more thorough look at what we're testing and whether we should keep testing, but this should tide us over until then |
Update: After re-running the PM hyperv test, I noticed cirrus reports a total runtime increase. So it appears Cirrus is counting re-run times. Meaning, the best-case runtime is probably better than an hour. |
There's are sometimes conflicting purposes in podman CI: 1. Have the pipeline proceed in an orderly and progressive manner to sometimes save resources and unnecessary runtime. 2. Complete all testing as quickly as possible in support of human-developers moving on to other areas of work. 3. Ideally/hopefully, accomplish both items above safely, preventing untested and/or unintended changes from merging. This commit shifts the balance of these slightly more toward the second point. It rearranges most CI tasks into essentially three buckets with a single (new) aggregation task in-between the first two: 1. Build + Verify all the things 2. Test all the things 3. Minor/accessory things The intention is that while we may unnecessarily spin some number of testing tasks while others have failed, the best-case scenario (everything passes) has a much shorter runtime. In other words, it potentially wastes more resources in favor of a chance to have developers wait less. Signed-off-by: Chris Evich <[email protected]>
ca398b2
to
f7d1726
Compare
Force push: Rebased |
LGTM |
Thanks whomever smashed the re-run button. |
This is ready to go now. |
LGTM |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cevich, rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There's are sometimes conflicting purposes in podman CI:
to sometimes save resources and unnecessary runtime.
human-developers moving on to other areas of work.
preventing untested and/or unintended changes from merging.
This commit shifts the balance of these slightly more toward the second
point. It rearranges most CI tasks into essentially three buckets with
a single (new) aggregation task in-between the first two:
The intention is that while we may unnecessarily spin some number of
testing tasks while others have failed, the best-case scenario
(everything passes) has a much shorter runtime. In other words, it
potentially wastes more resources in favor of a chance to have
developers wait less.
Does this PR introduce a user-facing change?