-
Notifications
You must be signed in to change notification settings - Fork 1.9k
ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions #5081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions #5081
Conversation
@@ -304,7 +302,7 @@ objects: | |||
- name: HOME | |||
value: /tmp | |||
command: | |||
- /bin/sh | |||
- /bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine, sh sucks anyway
lgtm |
azure reads from cluster to get region? or do we still need to have region set in the test container? |
b9e011f
to
757133a
Compare
No |
757133a
to
0a29515
Compare
I want it. |
… Random Azure regions Shard over four regions to increase our capacity, because Azure allows auto-bumps for vCPU limits up to 200 in each region, but it's a longer process to get the limits raised beyond that in a single region. This sets us up for more lease-restriction relaxing after 78ade84 (Prow: drop max azure CI clusters to 5, 2019-09-18, openshift#5074) and 594930f (Prow: increase max Azure CI clusters to 20, 2019-09-18, openshift#5084). We may want per-region quotas, but for the moment I'm just hoping that our random choices are even enough that an Azure-wide quota is sufficient. And conveniently, $RANDOM runs from 0 through 32767 [1], so our modulo 4 value is evenly weighted :). The change to Bash is because Bash supports $RANDOM [1], but POSIX does not [2]. We already use RANDOM in openshift-installer-master-presubmits.yaml since ca464ed (openshift/installer: add IaaS-agnostic E2E test, 2019-06-20, openshift#4148). Ideally we'd be loading the set of region choices (and possibly weights) from some shared location somewhat like ca464ed has things. And we'd be reporting the chosen region in a structured way for convenient monitoring. But the plan is to break up these templates into more composable chunks soon anyway, so I'm ok if we aren't all that DRY in the short term. [1]: https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#index-RANDOM [2]: https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html#tag_23_02_05_03
0a29515
to
4d08a9d
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stevekuznetsov, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
All green :). What's the hold for? Can we pull it? |
/hold cancel |
@wking: Updated the following 4 configmaps:
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
… Drop eastus Azure region Azure seems to have a shortage of VMs there today, with failures like [1]: level=error msg="Error: Code=\"ZonalAllocationFailed\" Message=\"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\"" You can see that failure was for eastus with: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/244/artifacts/e2e-azure-serial/installer/terraform.tfstate | jq -r '.resources[] | select(.name == "ignition_bootstrap").instances[].attributes.content | fromjson | .storage.files[] | select(.path == "/opt/openshift/manifests/cluster-config.yaml").contents.source' | sed 's/.*,//' | base64 -d | grep region region: eastus Jerry also found an eastus2 failure like this. While we could drop to an even weighting among the three remaining zones, in this commit I'm triple-weighting centralus because we have extra capacity there. Before 4d08a9d (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions, 2019-09-18, openshift#5081) we were serving all 20 of our Azure leases out of centralus. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/244
… Random AWS regions for IPI I'm leaving the UPI templates and such alone for now, because they mention regions in more places and I don't care as much about them ;). The approach mirrors what we did for Azure with 4d08a9d (ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions, 2019-09-18, openshift#5081). This should reduce resource contention (both for limits and for API throttling) in CI. We could theoretically raise our Boskos caps as well, but I'm leaving that off for now because we're getting hammered by AMI-copy issues [1]. I've also added ZONE_* variables in case we need to dodge broken zones like we have in the past with e8921c3 (ci-operator/templates/openshift: Get e2e-aws out of us-east-1b, 2019-03-22, openshift#3204). I have nothing against the other us-east-1 zones at the moment, the current ZONE_* overrides there are just to demonstrate the idea in case folks need to pivot later. Without explicit ZONE_* choices, we'll fall back to using zones a and b within the chosen region. The DevProductivity Platform (DPP) folks bumped our quotas in the new regions back in 2019-11 [2]. We should be good for up to 25% of our current us-east-1 capacity in those regions. Increases beyond that, or sharding into additional regions, will require additional DPP <-> AWS work to get limits raised. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1752370#c3 [2]: https://issues.redhat.com/browse/DPP-3108
Builds on #5079; land that first.
Shard over four regions to increase our capacity, because Azure allows auto-bumps for vCPU limits up to 200 in each region, but it's a longer process to get the limits raised beyond that in a single region. This allows us to relax the lease restriction from 78ade84 (Prow: drop max azure CI clusters to 5, 2019-09-18, #5074).
The change to Bash is because Bash supports
$RANDOM
, but POSIX does not. We already useRANDOM
in openshift-installer-master-presubmits.yaml since ca464ed (#4148).This is a hack to green up CI today. Ideally we'd be loading the set of region choices (and possibly weights) from some shared location somewhat like ca464ed has things. But the plan is to break up these templates into more composable chunks soon anyway, so I'm ok if we aren't all that DRY in the short term.