Skip to content

ci-operator/templates/openshift/installer/cluster-launch-installer-*: Random Azure regions #5081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

wking
Copy link
Member

@wking wking commented Sep 18, 2019

Builds on #5079; land that first.

Shard over four regions to increase our capacity, because Azure allows auto-bumps for vCPU limits up to 200 in each region, but it's a longer process to get the limits raised beyond that in a single region. This allows us to relax the lease restriction from 78ade84 (Prow: drop max azure CI clusters to 5, 2019-09-18, #5074).

The change to Bash is because Bash supports $RANDOM, but POSIX does not. We already use RANDOM in openshift-installer-master-presubmits.yaml since ca464ed (#4148).

This is a hack to green up CI today. Ideally we'd be loading the set of region choices (and possibly weights) from some shared location somewhat like ca464ed has things. But the plan is to break up these templates into more composable chunks soon anyway, so I'm ok if we aren't all that DRY in the short term.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 18, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2019
@@ -304,7 +302,7 @@ objects:
- name: HOME
value: /tmp
command:
- /bin/sh
- /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, sh sucks anyway

@smarterclayton
Copy link
Contributor

lgtm

@smarterclayton
Copy link
Contributor

azure reads from cluster to get region? or do we still need to have region set in the test container?

@wking wking force-pushed the shard-azure-over-multiple-regions branch from b9e011f to 757133a Compare September 18, 2019 19:48
@openshift-ci-robot openshift-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2019
@wking
Copy link
Member Author

wking commented Sep 18, 2019

azure reads from cluster to get region? or do we still need to have region set in the test container?

No AZURE_REGION in test in master. So I dunno how that works, but I don't think I'm breaking it ;).

@wking wking force-pushed the shard-azure-over-multiple-regions branch from 757133a to 0a29515 Compare September 18, 2019 21:04
@wking
Copy link
Member Author

wking commented Sep 18, 2019

Moved the case statement from the aws block to the azure4 block in cluster-launch-installer-e2e.yaml with 757133a5a -> 0a2951539.

But #5084 takes some pressure off this PR. I'm not sure if we want to keep it anyway (and bump our cap another 4x on top of #5084) or not.

@smarterclayton
Copy link
Contributor

I want it.

… Random Azure regions

Shard over four regions to increase our capacity, because Azure allows
auto-bumps for vCPU limits up to 200 in each region, but it's a longer
process to get the limits raised beyond that in a single region.  This
sets us up for more lease-restriction relaxing after 78ade84 (Prow:
drop max azure CI clusters to 5, 2019-09-18, openshift#5074) and 594930f
(Prow: increase max Azure CI clusters to 20, 2019-09-18, openshift#5084).  We
may want per-region quotas, but for the moment I'm just hoping that
our random choices are even enough that an Azure-wide quota is
sufficient.  And conveniently, $RANDOM runs from 0 through 32767 [1],
so our modulo 4 value is evenly weighted :).

The change to Bash is because Bash supports $RANDOM [1], but POSIX
does not [2].  We already use RANDOM in
openshift-installer-master-presubmits.yaml since ca464ed
(openshift/installer: add IaaS-agnostic E2E test, 2019-06-20, openshift#4148).

Ideally we'd be loading the set of region choices (and possibly
weights) from some shared location somewhat like ca464ed has
things.  And we'd be reporting the chosen region in a structured way
for convenient monitoring.  But the plan is to break up these
templates into more composable chunks soon anyway, so I'm ok if we
aren't all that DRY in the short term.

[1]: https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#index-RANDOM
[2]: https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html#tag_23_02_05_03
@wking wking force-pushed the shard-azure-over-multiple-regions branch from 0a29515 to 4d08a9d Compare September 18, 2019 22:38
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 18, 2019
@wking
Copy link
Member Author

wking commented Sep 18, 2019

Both Azure jobs made it through setup with the sharded regions. So I've rebased onto master around #5084 with 0a2951539 -> 4d08a9d, and we'll wait out the next round of rehearsals.

@stevekuznetsov
Copy link
Contributor

/lgtm
/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Sep 18, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stevekuznetsov, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Sep 19, 2019

All green :). What's the hold for? Can we pull it?

@smarterclayton
Copy link
Contributor

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2019
@openshift-merge-robot openshift-merge-robot merged commit c790aaa into openshift:master Sep 19, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: Updated the following 4 configmaps:

  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml

In response to this:

Builds on #5079; land that first.

Shard over four regions to increase our capacity, because Azure allows auto-bumps for vCPU limits up to 200 in each region, but it's a longer process to get the limits raised beyond that in a single region. This allows us to relax the lease restriction from 78ade84 (Prow: drop max azure CI clusters to 5, 2019-09-18, #5074).

The change to Bash is because Bash supports $RANDOM, but POSIX does not. We already use RANDOM in openshift-installer-master-presubmits.yaml since ca464ed (#4148).

This is a hack to green up CI today. Ideally we'd be loading the set of region choices (and possibly weights) from some shared location somewhat like ca464ed has things. But the plan is to break up these templates into more composable chunks soon anyway, so I'm ok if we aren't all that DRY in the short term.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking wking deleted the shard-azure-over-multiple-regions branch September 19, 2019 18:04
wking added a commit to wking/openshift-release that referenced this pull request Sep 25, 2019
… Drop eastus Azure region

Azure seems to have a shortage of VMs there today, with failures like
[1]:

  level=error msg="Error: Code=\"ZonalAllocationFailed\" Message=\"Allocation failed. We do not have sufficient capacity for the requested VM size in this zone. Read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance\""

You can see that failure was for eastus with:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/244/artifacts/e2e-azure-serial/installer/terraform.tfstate | jq -r '.resources[] | select(.name == "ignition_bootstrap").instances[].attributes.content | fromjson | .storage.files[] | select(.path == "/opt/openshift/manifests/cluster-config.yaml").contents.source' | sed 's/.*,//' | base64 -d | grep region
          region: eastus

Jerry also found an eastus2 failure like this.

While we could drop to an even weighting among the three remaining
zones, in this commit I'm triple-weighting centralus because we have
extra capacity there.  Before 4d08a9d
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random Azure regions, 2019-09-18, openshift#5081) we were serving all 20 of our
Azure leases out of centralus.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/244
wking added a commit to wking/openshift-release that referenced this pull request Jan 23, 2020
… Random AWS regions for IPI

I'm leaving the UPI templates and such alone for now, because they
mention regions in more places and I don't care as much about them ;).
The approach mirrors what we did for Azure with 4d08a9d
(ci-operator/templates/openshift/installer/cluster-launch-installer-*:
Random Azure regions, 2019-09-18, openshift#5081).  This should reduce resource
contention (both for limits and for API throttling) in CI.  We could
theoretically raise our Boskos caps as well, but I'm leaving that off
for now because we're getting hammered by AMI-copy issues [1].

I've also added ZONE_* variables in case we need to dodge broken zones
like we have in the past with e8921c3
(ci-operator/templates/openshift: Get e2e-aws out of us-east-1b,
2019-03-22, openshift#3204).  I have nothing against the other us-east-1 zones
at the moment, the current ZONE_* overrides there are just to
demonstrate the idea in case folks need to pivot later.  Without
explicit ZONE_* choices, we'll fall back to using zones a and b within
the chosen region.

The DevProductivity Platform (DPP) folks bumped our quotas in the new
regions back in 2019-11 [2].  We should be good for up to 25% of our
current us-east-1 capacity in those regions.  Increases beyond that,
or sharding into additional regions, will require additional DPP <->
AWS work to get limits raised.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1752370#c3
[2]: https://issues.redhat.com/browse/DPP-3108
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants