Skip to content

Rearrange CI tasks for safety + efficiency #21639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 22, 2024

Conversation

cevich
Copy link
Member

@cevich cevich commented Feb 13, 2024

There's are sometimes conflicting purposes in podman CI:

  1. Have the pipeline proceed in an orderly and progressive manner
    to sometimes save resources and unnecessary runtime.
  2. Complete all testing as quickly as possible in support of
    human-developers moving on to other areas of work.
  3. Ideally/hopefully, accomplish both items above safely,
    preventing untested and/or unintended changes from merging.

This commit shifts the balance of these slightly more toward the second
point. It rearranges most CI tasks into essentially three buckets with
a single (new) aggregation task in-between the first two:

  1. Build + Verify all the things
  2. Test all the things
  3. Minor/accessory things

The intention is that while we may unnecessarily spin some number of
testing tasks while others have failed, the best-case scenario
(everything passes) has a much shorter runtime. In other words, it
potentially wastes more resources in favor of a chance to have
developers wait less.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Feb 13, 2024
@cevich
Copy link
Member Author

cevich commented Feb 13, 2024

Option: It will bring slightly more complexity, but I could make it also require mac, windows, bsd, and the other cross builds as well. But that will haul in a requirement on validate.

Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@cevich
Copy link
Member Author

cevich commented Feb 13, 2024

cirrus yml

@cevich
Copy link
Member Author

cevich commented Feb 13, 2024

@mheon @edsantiago PTAL when you have a sec., what do you think: Should I make 'Total Build Success' also depend on the auxiliary (mac, windows, bsd, etc.) builds?

Context: The tide bot honors the branch-protection rules like its documentation declares.

Assumption: The tide bot is working (prow is VERY complex).

@openshift-ci openshift-ci bot added release-note-none and removed do-not-merge/release-note-label-needed Enforce release-note requirement, even if just None labels Feb 13, 2024
@mheon
Copy link
Member

mheon commented Feb 13, 2024

Um. I would vote yes, because breaking Mac/Windows is becoming a serious showstopper

@cevich
Copy link
Member Author

cevich commented Feb 13, 2024

Okay, I suspected as much. I'll work on re-arranging things...

@edsantiago
Copy link
Member

I've struggled to reconcile the YAML changes here against the commit message. Forgetting the commit message, my understanding is that this PR:

If either "build" or "build_aarch64" fails, prevent wasting CI cycles on a doomed commit.

Is that accurate? If so, could you please update the commit message and github description? All this "branch protection" and "unbuildable" talk confuses be, because that is ZERO change from the world today, in which no PR can merge because both build and build_aarch64 are required to pass already.

@edsantiago
Copy link
Member

And, if my interpretation is correct, why not make build_success depend on ALL builds? This diff:

@@ -212,6 +212,9 @@ build_success_task:
     depends_on:
         - build
         - build_aarch64
+        - alt_build
+        - freebsd_alt_build
+        - osx_alt_build
     env:
         CTR_FQIN: ${FEDORA_CONTAINER_FQIN}

(with further diffs trimmed, but obviously need some more s/build_success/build/ to avoid circular dependencies) converts the graph from
cirrus-map-pr21639-as-is
to
cirrus-map-pr21639-suggested
(diff is the blue boxes, upper left)

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

Is that accurate?

No, sorry, probably it's badly worded.

The point of this PR is to add a new "It builds" gate[0] for the github branch-protection settings, which should in turn also sensitize the tide bot. In other words, another "hoop" for the bot to jump through that will block merging of a PR early in the pipeline, rather than only at the end.

[0] Note: I'm working to add mac/windows/bsd builds as a requirement as per Matt's request.

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

Ugh, our updates crossed. Yes, the second graph is where this is headed.

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

I should clarify something: I have some mistrust in tide being able to block merge based on branch-protection settings. But github definitely will block merge. So having this happen early in the workflow is significant in that way.

@edsantiago
Copy link
Member

Let me try to ask a different way. WTF does "github branch protection" mean? As a nonprivileged user, I have no idea what rules are in place for this repo. Your newly-added comment reads:

# ....  It relies on github branch-protection blocking based
# on the status of this task _NAME_ (not alias).

but leaves the details to one's imagination. My first guess is that the rule is:

If task name includes the word "Success", and that job fails, block merge

Obviously this is untrue because there's a final task, success_task, with "Success" in the name, and if any build task fails today it will prevent success_task from going green. So adding a new "Success" job does not seem like it will cure anything.

I am missing something very basic, but I don't even know how to guess at what I am missing.

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

I am missing something very basic, but I don't even know how to guess at what I am missing.

Ahh thanks for providing background from your viewpoint. Branch protection has a LOT of settings, many of which we discussed as a team recently and are anti-helpful to developers working on this repository. For example, there's a check box for: Suggest "behind" PRs should rebase, and "require" them to be rebased. Both would seriously hurt getting changes in at scale (several reasons, but mainly that CI would constantly churn on the same set of PRs).

So merge-safety wise, we're almost 100% reliant on another branch-protection setting: "Required Status Checks". In this setting you can pick out previous checks from history, by exact name, that have run and passed, and mark them as merge-blocking. Meaning, github will gray-out the merge button if those specific named tasks do not pass. There's another setting that makes this apply even to those with admin rights.

Supposedly the tide merge bot honors the github branch-protection rules, including the "Required Status Checks" items. But in the past, the bot has allowed (at least one) PR to merge before all tests had passed. More recently, PRs have merged that couldn't even complile. I believe some of this bot-behavior was ultimately my fault (I'll spare the details) but I can't be 100% sure (evidence is not logged). So I don't 100% trust the bot, but I do 100% trust github graying out the merge button.

So all that said, what I want to do here I will spell out more completely in an updated commit message. Hopefully this background and a new picture will help it make more sense...

@cevich cevich changed the title [CI:DOCS] Require build Rearrange CI tasks for safety + efficiency Feb 14, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2024
@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

cirrus yml

@edsantiago
Copy link
Member

Strong disagreement. Why did you remove all the "validate"? validate is quick and cheap and a sure-fire requirement for anything. All tests should depend on validate.

@edsantiago
Copy link
Member

cirrus-map-pr21639

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

validate is quick and cheap

What I observed is validate blocking everything else for nearly 12 minutes.

All tests should depend on validate.

Ultimately they do. If left alone with an invalid change, primary_success will block it. This way is simply optimized more in favor of total runtime vs saving money on VMs. Besides, github will sort any failures toward he top of the checks list (in the PR view). When a developer observes that, they can fix and re-push. All then-obsolete tasks will get automatically canceled.

@cevich
Copy link
Member Author

cevich commented Feb 14, 2024

Oh and: See my e-mail about adding an "on-failure" widget. Assuming it's possible, that will also bring the train to a rapid stop if certain key tasks fail.

@edsantiago
Copy link
Member

Okay, then maybe include validate* in build_success?
cirrus-map-

Reason:

  • validate is a blocker. Failing this will require a repush; flakes are not likely here
  • validate is also a very common PR failure mode: missing 'NO NEW TESTS', or lint, or build-each-commit,
  • there is literally no point whatsoever to run real tests if validate fails.

I see this as a basic fail-fast best practice.

And of course I still don't understand the point of this PR, since I still don't understand all this "branch protection" talk, and from what little I do understand, it seems like it doesn't work anyway, so adding a new rule is unlikely to make anything work anyway?

@edsantiago
Copy link
Member

I just noticed an IRC thread, but it's internal so I won't copy-pasted it (even though I see nothing sensitive in it).

I have no idea what all the words mean, but one interpretation is that my hypothesis is correct (the "branch protection" thing is broken, and adding a new "branch protection" thing is likely to suffer from the same breakage). If that is a correct interpretation, maybe we should focus on fixing the breakage? There is still value in rearranging CI jobs, but (IMO) little value in the new build_success task. And if my interpretation is wrong, I would appreciate clarification of what's going on.

@cevich
Copy link
Member Author

cevich commented Feb 15, 2024

okay thanks. TBH, I'm not thrilled about all the waste either, it hurts the Earth 😞

I'm letting my subconscious mull over a way we could setup some kind of 'on-failure' E-Stop system for critical tasks. So, for example a validate failure could (somehow) cause the entire Cirrus run to abort. Maybe I/someone can come up with a sexy way to do it.

@cevich
Copy link
Member Author

cevich commented Feb 16, 2024

Force push: Worked verification into the build section. Updated commit message. Updated graph:

cirrus yml

I have half-a-mind to combine the last section of "extra" tests into the big center blob. But maybe will leave that up to a future commit if it's necessary to shave a few more minutes.

@edsantiago
Copy link
Member

This is going to make everything slower, not faster: bud tests are 33-35 minutes. As of today, they key on local int, and can run in parallel with other tests. With this PR, they will not even start until all other tests have finished, adding 35 minutes to run time.

Step back: I believe primary_success is no longer necessary. If you agree, please remove it. I do think it's safe and sensible to gate bud_test on local-and-remote-system-test. Perhaps farm, minikube, and upgrade also.
cirrus-map-pr21639

@cevich
Copy link
Member Author

cevich commented Feb 19, 2024

There should be a small ~10min gain by moving the machine tests into "the big blob". But yeah, my half-mind + Ed's full mind suggest we should have just one big, main-test blob. Let me do that up and see how it rolls...

@cevich
Copy link
Member Author

cevich commented Feb 19, 2024

...okay, I'm going to forego the picture since by now we can all just imagine what what it looks like 😄

@edsantiago
Copy link
Member

Cirrus clockDurationInSeconds reports 1:11:46, which is (checks calculator) less than the typical 2:00-2:30. Of course this doesn't count the time wasted in retrying flakes.

@cevich
Copy link
Member Author

cevich commented Feb 20, 2024

Lol, I just went and looked up that number, then saw your comment 😞 🤣

So best-case scenario CI is almost 50% faster. Worst-case, it still requires humans to notice failed tasks and take action + we probably waste a bunch of VM-time.

@containers/podman-maintainers PTAL what do you think?

@cevich cevich marked this pull request as ready for review February 20, 2024 14:59
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2024
@cevich
Copy link
Member Author

cevich commented Feb 20, 2024

(re-ran PM hyperv test)

@mheon
Copy link
Member

mheon commented Feb 20, 2024

After 5.0 we really need a more thorough look at what we're testing and whether we should keep testing, but this should tide us over until then

@cevich
Copy link
Member Author

cevich commented Feb 20, 2024

Update: After re-running the PM hyperv test, I noticed cirrus reports a total runtime increase. So it appears Cirrus is counting re-run times. Meaning, the best-case runtime is probably better than an hour.

@edsantiago
Copy link
Member

LGTM. If this meets with team approval, I'd feel more comfortable if you rebase before final merge.

Current diagram:
cirrus-map-pr21639

There's are sometimes conflicting purposes in podman CI:

1. Have the pipeline proceed in an orderly and progressive manner
   to sometimes save resources and unnecessary runtime.
2. Complete all testing as quickly as possible in support of
   human-developers moving on to other areas of work.
3. Ideally/hopefully, accomplish both items above safely,
   preventing untested and/or unintended changes from merging.

This commit shifts the balance of these slightly more toward the second
point.  It rearranges most CI tasks into essentially three buckets with
a single (new) aggregation task in-between the first two:

1. Build + Verify all the things
2. Test all the things
3. Minor/accessory things

The intention is that while we may unnecessarily spin some number of
testing tasks while others have failed, the best-case scenario
(everything passes) has a much shorter runtime.  In other words, it
potentially wastes more resources in favor of a chance to have
developers wait less.

Signed-off-by: Chris Evich <[email protected]>
@cevich
Copy link
Member Author

cevich commented Feb 21, 2024

Force push: Rebased

@TomSweeneyRedHat
Copy link
Member

LGTM
one test, "machine-hyperv podman windows rootless host sqlite " between you and happy green test buttons.

@cevich
Copy link
Member Author

cevich commented Feb 21, 2024

Thanks whomever smashed the re-run button.

@cevich
Copy link
Member Author

cevich commented Feb 22, 2024

This is ready to go now.

@mheon
Copy link
Member

mheon commented Feb 22, 2024

LGTM

@rhatdan
Copy link
Member

rhatdan commented Feb 22, 2024

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024
Copy link
Contributor

openshift-ci bot commented Feb 22, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit bb44510 into containers:main Feb 22, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label May 23, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators May 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants