OCS Functional Test Suite #114

davidvossel · 2019-09-12T14:54:05Z

Functional Tests

Our functional test suite uses the ginkgo testing framework.

Prerequisites for running Functional Tests

ocs must already be installed
KUBECONFIG env var must be set

Running functional test

make functest

Below is some sample output of what to expect.

Building functional tests
hack/build-functest.sh
GINKO binary found at /home/dvossel/go/bin/ginkgo
Compiling functests...
    compiled functests.test
Running functional test suite
hack/functest.sh
Running Functional Test Suite
Running Suite: Tests Suite
==========================
Random Seed: 1568299067
Will run 1 of 1 specs

•
Ran 1 of 1 Specs in 7.961 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Developing Functional Tests

All the functional test code lives in the functests/ directory. For an
example of how a functional test is structured, look at the functests/pvc_creation_test.go
file.

The tests themselves should invoke simple to understand steps. Put any complex
logic in the functests/utils.go package so test flows are easy to follow.

Running a single test
When developing a test, it's common to just want to run a single functional test
rather than the whole suite. This can be done using ginkgo's "focus" feature.

All you have to do is put a F in front of the tests declaration to force
only that test to run. So, if you have an iteration like It("some test")
defined, you just need to set that to FIt("some test") to force the test
suite to only execute that single test.

Make sure to remove the focus from your test before creating the pull request.
Otherwise the test suite will fail in CI.

davidvossel · 2019-09-12T14:58:39Z

/cc @jarrpa @obnoxxx

davidvossel · 2019-09-12T15:13:52Z

i see the golint/gotest failures. I'll address that. I'm want to get e2e-aws results first before I push another update though

obnoxxx · 2019-09-12T15:17:42Z

Thanks @davidvossel !

obnoxxx

It might be nice to have the commit message of the "Introduction of functional test suite" commit mention that you are adding a single test "PVC Creation".

Other than that - this looks nice!
Of course requesting to fix the linting errors.

obnoxxx · 2019-09-12T15:20:56Z

That was not a thorough review yet, just a first superficial look. I'll let @jarrpa and others do more thorough review too.

jarrpa

An overall comment: Anything called utils getting too large is often a bad thing. Could you chop that up into more descriptive files like create or config?

jarrpa · 2019-09-12T15:16:08Z

functests/utils.go

+	return t.k8sClient
+}
+
+func (t *TestClient) CreateStorageClusterInitialization() (*ocsv1.StorageClusterInitialization, error) {


This function doesn't seem to be used anywhere... and really, the StorageCluster reconcile creates the StorageClusterInit already. Is this necessary?

good catch. i noticed that as well when I went to add documentation to fix golint, I'll remove it.

jarrpa · 2019-09-12T15:23:03Z

functests/test.go

@@ -0,0 +1 @@
+package tests


Can we rename this to something like func_tests? Or is this a ginkgo thing?

i named it functests btw instead of func_tests

davidvossel · 2019-09-12T19:05:16Z

pr updated

It might be nice to have the commit message of the "Introduction of functional test suite" commit mention that you are adding a single test "PVC Creation".

I added something like that to the commit message description now.

An overall comment: Anything called utils getting too large is often a bad thing. Could you chop that up into more descriptive files like create or config?

yep, makes sense. I broke it up into a few files.

i see the golint/gotest failures. I'll address that. I'm want to get e2e-aws results first before I push another update though

so, for the gotest prow failure, I added a workaround for this into the PR, but I'll ultimately need to solve this by altering the CI config. The issue is that ginkgo can run either by running go test -v ./... using the ginkgo command. So, when we attempt to run our unit tests, it's attempting to execute the functional tests as well.

For now, I have the functional tests skip if it detects that the test wasn't invoked by make functests, but the better solution is to have the prow gotest entry not test the functests/ directory. I'll follow up with something related to that soon.

Also, for the e2e test itself, I saw the PVC creation timeout. I'm working through what's going on there now. Since the test passes locally, I suspect there's something with the aws environment in CI that is preventing us from being able to provision PVCs. Hopefully it's something we can work around. I added debug dumping to the functest script to try and gather more information.

davidvossel · 2019-09-12T20:30:47Z

make functest passes now :)

obnoxxx

Nice!
I think @jarrpa's request to split up utils.go still remains. Otherwise looks good to me.

davidvossel · 2019-09-13T12:27:31Z

I think @jarrpa's request to split up utils.go still remains. Otherwise looks good to me.

Yeah, i broke up the file a bit, but looking back at it I think I can do better. I'll make another update

davidvossel · 2019-09-13T13:22:32Z

update pushed. Here's how functests/utils.go has been broken up.

functests/common.go - common functions used by functional tests. For now this is just two helpers involved with PVC generation and waiting for PVC bound phase.

functests/config.go - constants and restclient initialization

functests/setup_teardown.go - any logic surrounding setting up and tearing down the testsuite that occurs before all tests are executed and after all tests are complete.

if some more obvious patterns emerge for helper function grouping after more tests are written we might re-approach common.go, but for now I don't have a better suggestion.

jarrpa · 2019-09-13T14:16:58Z

hack/build-registry-bundle.sh

@@ -1,5 +1,7 @@
 #!/bin/bash

+set -e


Do we still want this in the final commit?

well, yes. We do want it, but I probably didn't put it in a commit that makes sense for the change.

The issue I hit was 'make ocs-registry' would "pass" even if an error occurred during the docker build process. Without 'set -e', the exit code of make ocs-registry doesn't indicate anything useful.

I'll break it into a separate commit at least.

fixed. i broke this into a separate commit.

davidvossel · 2019-09-13T14:54:53Z

I think the functests uncovered a possible issue. It looks like the StorageCluster is reporting that it is available before the osd's have started. This causes the PVC test to fail occasionally. Whether the test fails or not appears to be related to timing.

below's a snapshot of what the e2e cluster looked like at the time of a failure.

csi-cephfsplugin-l9jxx                          3/3     Running    0          3m28s
csi-cephfsplugin-p7klm                          3/3     Running    0          3m28s
csi-cephfsplugin-provisioner-678db5bf66-6nxnb   4/4     Running    0          3m28s
csi-cephfsplugin-provisioner-678db5bf66-kpjpf   4/4     Running    0          3m28s
csi-cephfsplugin-qhzps                          3/3     Running    0          3m28s
csi-rbdplugin-52czt                             3/3     Running    0          3m28s
csi-rbdplugin-provisioner-7bf4556986-f4dmk      5/5     Running    0          3m28s
csi-rbdplugin-provisioner-7bf4556986-wdxpc      5/5     Running    0          3m28s
csi-rbdplugin-schp5                             3/3     Running    0          3m28s
csi-rbdplugin-sxpkx                             3/3     Running    0          3m28s
local-storage-operator-bcfd5765f-88tf6          1/1     Running    0          4m37s
noobaa-core-0                                   0/2     Pending    0          3m27s
noobaa-operator-5c5cb5bb67-rn5r8                1/1     Running    0          4m24s
ocs-operator-677fdf9547-xd9gt                   0/1     Running    0          4m24s
rook-ceph-mon-a-7987747df5-t245v                1/1     Running    0          94s
rook-ceph-mon-b-5657556cfb-7zssz                1/1     Running    0          70s
rook-ceph-mon-c-68c9874db9-7shm4                0/1     Init:0/2   0          27s
rook-ceph-operator-9b567fdc-c8595               1/1     Running    0          4m24s

To make the test pass consistently, I can wait in the setup function for the osds to come online. There might need to be some more discussion about that though.

davidvossel · 2019-09-13T15:45:16Z

I think the functests uncovered a possible issue. It looks like the StorageCluster is reporting that it is available before the osd's have started.

I think I traced the issue back to how the functional tests wait for the StorageCluster startup condition. If not all expected conditions existed in the list, it looked like it was possible for the "BeforeTestSuite" function to exit earlier than it should have.

I added logic to address this. We'll see if it improves the situation. I was able to verify that the StorageCluster's conditions were set correctly to indicate that it was still progressing when the OSDs were not online. So, it seems likely that what I saw was test related.

davidvossel · 2019-09-13T17:05:40Z

The latest CSV appears to be broken. The functional test failure looks real.

2019-09-13 17:01:13.574026 E | op-osd: failed to create provision job for node example-deviceset-0-jhkfl. Job.batch "rook-ceph-osd-prepare-example-deviceset-0-jhkfl" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
2019-09-13 17:01:13.811323 E | op-osd: failed to create provision job for node example-deviceset-1-bqqv6. Job.batch "rook-ceph-osd-prepare-example-deviceset-1-bqqv6" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
2019-09-13 17:01:15.818636 E | op-osd: failed to create provision job for node example-deviceset-2-6kv9t. Job.batch "rook-ceph-osd-prepare-example-deviceset-2-6kv9t" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]

and

2019-09-13 17:01:59.350313 E | op-cluster: failed to create cluster in namespace openshift-storage. failed to start the osds. 3 failures encountered while running osds in namespace openshift-storage: failed to create provision job for node example-deviceset-0-jhkfl. Job.batch "rook-ceph-osd-prepare-example-deviceset-0-jhkfl" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
failed to create provision job for node example-deviceset-1-bqqv6. Job.batch "rook-ceph-osd-prepare-examp

rook-ceph can't start the osd pods. It seems likely this is some regression in the new rook-ceph image we introduced a few hours ago in this pr, #121

jarrpa · 2019-09-13T17:16:06Z

How did the test pass on that PR then?

davidvossel · 2019-09-13T18:55:33Z

How did the test pass on that PR then?

It passed yesterday before #121 was merged

note that when the e2e test runs, it's actually merging with master. So even if I don't rebase this PR, it's this PR+Master that's being reflected in the results.

davidvossel · 2019-09-13T18:58:31Z

How did the test pass on that PR then?

I might have misunderstood the question. If you meant how did the e2e test pass on pr #121 it's because there are no functional tests being executed until this PR is merged. The e2e logic as it is without the functional test suite just verifies that the ocs-operator installs and that the operator is online, not that a storage cluster comes online. That's the part this PR adds, along with verifying PVC creation.

davidvossel · 2019-09-13T20:02:17Z

/test ocs-operator-e2e-aws

davidvossel · 2019-09-16T12:39:16Z

The tests fail waiting for OSD to come up. From the few test logs I saw, the failure happened just as osds were being created. So maybe bump up the timeout for OSDs

I'll up the timeout for now so we can get this PR in.

Independently of this PR I'm going to circle back around and try to gain an understanding for why this is occurring though. It appears like the StorageCluster CR may be reporting that it's in a stable state (available, upgradable, not progressing, and not degrading) at some point before the OSDs actually come online, but the CR is clearly reflecting the correct status of the StorageCluster when the timeout occurs.

davidvossel · 2019-09-16T13:57:44Z

I bumped the setup timeout to 30 minutes, we'll see if this improves the situation.

davidvossel · 2019-09-16T17:26:05Z

I changed the logic to wait an entire hour for the OSDs to come online, and the e2e test is still timing out.

I'm trying to reproduce this locally using a similar cluster to what the e2e tests use. One difference I noticed is that they're only using two AZs, but after talking with @jarrpa it appears this theoretically this should be fine.

davidvossel · 2019-09-16T19:14:14Z

I was able to reproduce what's happening in Openshift CI by limiting my OCP deployment to only 2 availability zones in AWS.

Only 1 or 2 OSDs ever come online. The OSDs that don't come online complain about not being able to attach a PVC to the OSDs pod. Below is an example.

Events:
  Type     Reason              Age                 From                                   Message
  ----     ------              ----                ----                                   -------
  Normal   Scheduled           46m                 default-scheduler                      Successfully assigned openshift-storage/rook-ceph-osd-0-574bb9858b-kfltj to ip-10-0-129-254.ec2.internal
  Warning  FailedAttachVolume  46m                 attachdetach-controller                Multi-Attach error for volume &quot;pvc-230b857f-d8af-11e9-b766-0e15af88a5cc&quot; Volume is already used by pod(s) rook-ceph-osd-prepare-example-deviceset-2-khhxc-22t6n
  Warning  FailedMount         88s (x20 over 44m)  kubelet, ip-10-0-129-254.ec2.internal  Unable to mount volumes for pod &quot;rook-ceph-osd-0-574bb9858b-kfltj_openshift-storage(43ee9141-d8af-11e9-bed1-0a89cbfd94e0)&quot;: timeout expired waiting for volumes to attach or mount for pod &quot;openshift-storage&quot;/&quot;rook-ceph-osd-0-574bb9858b-kfltj&quot;. list of unmounted volumes=[example-deviceset-2-khhxc]. list of unattached volumes=[rook-data rook-config-override rook-ceph-log devices example-deviceset-2-khhxc example-deviceset-2-khhxc-bridge rook-binaries run-udev rook-ceph-osd-token-fcbk9]

In order to stabilize the functional tests so we can get at least something merged that verifies basic funcitonality, I'm experimenting with reducing the OSD replica count to 1 just for the functional test environment.

davidvossel · 2019-09-17T18:47:23Z

@jarrpa @obnoxxx Functional tests pass again. I had to deploy a work around for issue #131

The osd issue occurs when we have multiple worker nodes within the same AZ. Openshift CI gives us only 2 AZs and 3 worker nodes, which means we'll always have at least 1 extra node in an AZ. For the time being, in order to work around #131 I only label a single node from each AZ and limit the functional test to Replica 2.

Using Replica 2 for the e2e test suite should only be considered a stop gap. I'm tracking the root cause of the issue in #131 and I've also opened up a discussion with Openshift CI about increasing the AZ count to 3. openshift/release#5047

Here's my suggestion for moving forward.

Merge this PR so we have an added level of confidence when developing ocs-operator changes.
Consider removing the e2e test as a mandatory test. As maintainers, this lets you use the test to gain confidence about risky changes while not being held back by a long running (and complex test) for simple changes. Basically, this is an escape hatch to keep the e2e test from blocking progress.
Change the test to utilize replica 3 once either OSDs fail to come online with multiple worker nodes in the same AZ. #131 or Expand AWS e2e cluster across 3 availability zones openshift/release#5047 are finalized.

obnoxxx · 2019-09-18T14:10:57Z

I'm fine with moving forward like this.

jarrpa

Overall looks good. Can you squash the non-vendor commits?

obnoxxx · 2019-09-18T14:23:44Z

@jarrpa @davidvossel :

Overall looks good. Can you squash the non-vendor commits?

Maybe leave the commit that is taking replica down to 2 separate so that we can revert it later?

Signed-off-by: David Vossel <[email protected]>

Includes initial example test in functests/pvc_create_test.go Signed-off-by: David Vossel <[email protected]>

Signed-off-by: David Vossel <[email protected]>

davidvossel · 2019-09-18T14:31:34Z

Maybe leave the commit that is taking replica down to 2 separate so that we can revert it later?

@obnoxxx @jarrpa I squashed down the commits to what I thought looked like good logical chunks. I think where a few changes that made sense as standalone commits.

jarrpa

/lgtm

openshift-ci-robot · 2019-09-18T14:41:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, jarrpa, obnoxxx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jarrpa,obnoxxx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jarrpa · 2019-09-18T15:11:06Z

/hold cancel

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 12, 2019

openshift-ci-robot requested review from mhrivnak and obnoxxx September 12, 2019 14:54

openshift-ci-robot requested a review from jarrpa September 12, 2019 14:58

obnoxxx requested changes Sep 12, 2019

View reviewed changes

jarrpa requested changes Sep 12, 2019

View reviewed changes

davidvossel force-pushed the functional-test-suite branch from 6ea3489 to 16a824c Compare September 12, 2019 18:57

davidvossel force-pushed the functional-test-suite branch from 16a824c to db9b7c3 Compare September 12, 2019 19:07

obnoxxx approved these changes Sep 13, 2019

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2019

davidvossel force-pushed the functional-test-suite branch 2 times, most recently from e54bba0 to 48600d3 Compare September 13, 2019 13:16

davidvossel force-pushed the functional-test-suite branch from 48600d3 to 08fea7e Compare September 13, 2019 13:22

jarrpa reviewed Sep 13, 2019

View reviewed changes

davidvossel force-pushed the functional-test-suite branch from 08fea7e to 00212d0 Compare September 13, 2019 15:41

davidvossel force-pushed the functional-test-suite branch from 00212d0 to 9d55b0f Compare September 16, 2019 12:47

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2019

davidvossel force-pushed the functional-test-suite branch from 9d55b0f to 1ed12f9 Compare September 16, 2019 13:49

davidvossel force-pushed the functional-test-suite branch from 1ed12f9 to 8a32b7d Compare September 16, 2019 15:25

davidvossel force-pushed the functional-test-suite branch 5 times, most recently from 1c30366 to 9f4df25 Compare September 17, 2019 16:32

jarrpa requested changes Sep 18, 2019

View reviewed changes

davidvossel added 5 commits September 18, 2019 10:25

Add functest pkg dependencies

a1f8fa9

Signed-off-by: David Vossel <[email protected]>

Introduction of functional test suite

f1bf953

Includes initial example test in functests/pvc_create_test.go Signed-off-by: David Vossel <[email protected]>

reduce to replica two for test-suite

06731d5

Signed-off-by: David Vossel <[email protected]>

Add ability to set MON_COUNT_OVERRIDE env var for openshift_ci.

3dbeb90

Signed-off-by: David Vossel <[email protected]>

Ensure that 'make ocs-registry' fails if docker build fails

1c30e6a

Signed-off-by: David Vossel <[email protected]>

davidvossel force-pushed the functional-test-suite branch from 9f4df25 to 1c30e6a Compare September 18, 2019 14:28

jarrpa approved these changes Sep 18, 2019

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2019

openshift-merge-robot merged commit d86f2b9 into red-hat-storage:master Sep 18, 2019

OCS Functional Test Suite #114

OCS Functional Test Suite #114

Uh oh!

Conversation

davidvossel commented Sep 12, 2019

Functional Tests

Uh oh!

davidvossel commented Sep 12, 2019

Uh oh!

davidvossel commented Sep 12, 2019

Uh oh!

obnoxxx commented Sep 12, 2019

Uh oh!

obnoxxx left a comment

Choose a reason for hiding this comment

Uh oh!

obnoxxx commented Sep 12, 2019

Uh oh!

jarrpa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidvossel commented Sep 12, 2019

Uh oh!

davidvossel commented Sep 12, 2019

Uh oh!

obnoxxx left a comment

Choose a reason for hiding this comment

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

jarrpa commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 13, 2019

Uh oh!

davidvossel commented Sep 16, 2019

Uh oh!

davidvossel commented Sep 16, 2019

Uh oh!

davidvossel commented Sep 16, 2019

Uh oh!

davidvossel commented Sep 16, 2019

Uh oh!

davidvossel commented Sep 17, 2019

Uh oh!

obnoxxx commented Sep 18, 2019

Uh oh!

jarrpa left a comment

Choose a reason for hiding this comment

Uh oh!

obnoxxx commented Sep 18, 2019

Uh oh!

davidvossel commented Sep 18, 2019