Skip to content

OCS Functional Test Suite #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

davidvossel
Copy link
Contributor

Functional Tests

Our functional test suite uses the ginkgo testing framework.

Prerequisites for running Functional Tests

  • ocs must already be installed
  • KUBECONFIG env var must be set

Running functional test

make functest

Below is some sample output of what to expect.

Building functional tests
hack/build-functest.sh
GINKO binary found at /home/dvossel/go/bin/ginkgo
Compiling functests...
    compiled functests.test
Running functional test suite
hack/functest.sh
Running Functional Test Suite
Running Suite: Tests Suite
==========================
Random Seed: 1568299067
Will run 1 of 1 specs

•
Ran 1 of 1 Specs in 7.961 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
PASS

Developing Functional Tests

All the functional test code lives in the functests/ directory. For an
example of how a functional test is structured, look at the functests/pvc_creation_test.go
file.

The tests themselves should invoke simple to understand steps. Put any complex
logic in the functests/utils.go package so test flows are easy to follow.

Running a single test
When developing a test, it's common to just want to run a single functional test
rather than the whole suite. This can be done using ginkgo's "focus" feature.

All you have to do is put a F in front of the tests declaration to force
only that test to run. So, if you have an iteration like It("some test")
defined, you just need to set that to FIt("some test") to force the test
suite to only execute that single test.

Make sure to remove the focus from your test before creating the pull request.
Otherwise the test suite will fail in CI.

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 12, 2019
@davidvossel
Copy link
Contributor Author

/cc @jarrpa @obnoxxx

@davidvossel
Copy link
Contributor Author

i see the golint/gotest failures. I'll address that. I'm want to get e2e-aws results first before I push another update though

@obnoxxx
Copy link
Contributor

obnoxxx commented Sep 12, 2019

Thanks @davidvossel !

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to have the commit message of the "Introduction of functional test suite" commit mention that you are adding a single test "PVC Creation".

Other than that - this looks nice!
Of course requesting to fix the linting errors.

@obnoxxx
Copy link
Contributor

obnoxxx commented Sep 12, 2019

That was not a thorough review yet, just a first superficial look. I'll let @jarrpa and others do more thorough review too.

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overall comment: Anything called utils getting too large is often a bad thing. Could you chop that up into more descriptive files like create or config?

return t.k8sClient
}

func (t *TestClient) CreateStorageClusterInitialization() (*ocsv1.StorageClusterInitialization, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to be used anywhere... and really, the StorageCluster reconcile creates the StorageClusterInit already. Is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. i noticed that as well when I went to add documentation to fix golint, I'll remove it.

@@ -0,0 +1 @@
package tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to something like func_tests? Or is this a ginkgo thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i named it functests btw instead of func_tests

@davidvossel
Copy link
Contributor Author

pr updated

It might be nice to have the commit message of the "Introduction of functional test suite" commit mention that you are adding a single test "PVC Creation".

I added something like that to the commit message description now.

An overall comment: Anything called utils getting too large is often a bad thing. Could you chop that up into more descriptive files like create or config?

yep, makes sense. I broke it up into a few files.

i see the golint/gotest failures. I'll address that. I'm want to get e2e-aws results first before I push another update though

so, for the gotest prow failure, I added a workaround for this into the PR, but I'll ultimately need to solve this by altering the CI config. The issue is that ginkgo can run either by running go test -v ./... using the ginkgo command. So, when we attempt to run our unit tests, it's attempting to execute the functional tests as well.

For now, I have the functional tests skip if it detects that the test wasn't invoked by make functests, but the better solution is to have the prow gotest entry not test the functests/ directory. I'll follow up with something related to that soon.

Also, for the e2e test itself, I saw the PVC creation timeout. I'm working through what's going on there now. Since the test passes locally, I suspect there's something with the aws environment in CI that is preventing us from being able to provision PVCs. Hopefully it's something we can work around. I added debug dumping to the functest script to try and gather more information.

@davidvossel
Copy link
Contributor Author

make functest passes now :)

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I think @jarrpa's request to split up utils.go still remains. Otherwise looks good to me.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2019
@davidvossel
Copy link
Contributor Author

I think @jarrpa's request to split up utils.go still remains. Otherwise looks good to me.

Yeah, i broke up the file a bit, but looking back at it I think I can do better. I'll make another update

@davidvossel davidvossel force-pushed the functional-test-suite branch 2 times, most recently from e54bba0 to 48600d3 Compare September 13, 2019 13:16
@davidvossel
Copy link
Contributor Author

update pushed. Here's how functests/utils.go has been broken up.

functests/common.go - common functions used by functional tests. For now this is just two helpers involved with PVC generation and waiting for PVC bound phase.

functests/config.go - constants and restclient initialization

functests/setup_teardown.go - any logic surrounding setting up and tearing down the testsuite that occurs before all tests are executed and after all tests are complete.

if some more obvious patterns emerge for helper function grouping after more tests are written we might re-approach common.go, but for now I don't have a better suggestion.

@@ -1,5 +1,7 @@
#!/bin/bash

set -e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still want this in the final commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, yes. We do want it, but I probably didn't put it in a commit that makes sense for the change.

The issue I hit was 'make ocs-registry' would "pass" even if an error occurred during the docker build process. Without 'set -e', the exit code of make ocs-registry doesn't indicate anything useful.

I'll break it into a separate commit at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. i broke this into a separate commit.

@davidvossel
Copy link
Contributor Author

I think the functests uncovered a possible issue. It looks like the StorageCluster is reporting that it is available before the osd's have started. This causes the PVC test to fail occasionally. Whether the test fails or not appears to be related to timing.

below's a snapshot of what the e2e cluster looked like at the time of a failure.

csi-cephfsplugin-l9jxx                          3/3     Running    0          3m28s
csi-cephfsplugin-p7klm                          3/3     Running    0          3m28s
csi-cephfsplugin-provisioner-678db5bf66-6nxnb   4/4     Running    0          3m28s
csi-cephfsplugin-provisioner-678db5bf66-kpjpf   4/4     Running    0          3m28s
csi-cephfsplugin-qhzps                          3/3     Running    0          3m28s
csi-rbdplugin-52czt                             3/3     Running    0          3m28s
csi-rbdplugin-provisioner-7bf4556986-f4dmk      5/5     Running    0          3m28s
csi-rbdplugin-provisioner-7bf4556986-wdxpc      5/5     Running    0          3m28s
csi-rbdplugin-schp5                             3/3     Running    0          3m28s
csi-rbdplugin-sxpkx                             3/3     Running    0          3m28s
local-storage-operator-bcfd5765f-88tf6          1/1     Running    0          4m37s
noobaa-core-0                                   0/2     Pending    0          3m27s
noobaa-operator-5c5cb5bb67-rn5r8                1/1     Running    0          4m24s
ocs-operator-677fdf9547-xd9gt                   0/1     Running    0          4m24s
rook-ceph-mon-a-7987747df5-t245v                1/1     Running    0          94s
rook-ceph-mon-b-5657556cfb-7zssz                1/1     Running    0          70s
rook-ceph-mon-c-68c9874db9-7shm4                0/1     Init:0/2   0          27s
rook-ceph-operator-9b567fdc-c8595               1/1     Running    0          4m24s

To make the test pass consistently, I can wait in the setup function for the osds to come online. There might need to be some more discussion about that though.

@davidvossel
Copy link
Contributor Author

I think the functests uncovered a possible issue. It looks like the StorageCluster is reporting that it is available before the osd's have started.

I think I traced the issue back to how the functional tests wait for the StorageCluster startup condition. If not all expected conditions existed in the list, it looked like it was possible for the "BeforeTestSuite" function to exit earlier than it should have.

I added logic to address this. We'll see if it improves the situation. I was able to verify that the StorageCluster's conditions were set correctly to indicate that it was still progressing when the OSDs were not online. So, it seems likely that what I saw was test related.

@davidvossel
Copy link
Contributor Author

The latest CSV appears to be broken. The functional test failure looks real.

2019-09-13 17:01:13.574026 E | op-osd: failed to create provision job for node example-deviceset-0-jhkfl. Job.batch "rook-ceph-osd-prepare-example-deviceset-0-jhkfl" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
2019-09-13 17:01:13.811323 E | op-osd: failed to create provision job for node example-deviceset-1-bqqv6. Job.batch "rook-ceph-osd-prepare-example-deviceset-1-bqqv6" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
2019-09-13 17:01:15.818636 E | op-osd: failed to create provision job for node example-deviceset-2-6kv9t. Job.batch "rook-ceph-osd-prepare-example-deviceset-2-6kv9t" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]

and

2019-09-13 17:01:59.350313 E | op-cluster: failed to create cluster in namespace openshift-storage. failed to start the osds. 3 failures encountered while running osds in namespace openshift-storage: failed to create provision job for node example-deviceset-0-jhkfl. Job.batch "rook-ceph-osd-prepare-example-deviceset-0-jhkfl" is invalid: [spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Required value: can not be empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must be non-empty, spec.template.spec.affinity.podAntiAffinity.preferredDuringSchedulingIgnoredDuringExecution[0].podAffinityTerm.topologyKey: Invalid value: "": name part must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]')]
failed to create provision job for node example-deviceset-1-bqqv6. Job.batch "rook-ceph-osd-prepare-examp

rook-ceph can't start the osd pods. It seems likely this is some regression in the new rook-ceph image we introduced a few hours ago in this pr, #121

@jarrpa
Copy link
Member

jarrpa commented Sep 13, 2019

How did the test pass on that PR then?

@davidvossel
Copy link
Contributor Author

How did the test pass on that PR then?

It passed yesterday before #121 was merged

note that when the e2e test runs, it's actually merging with master. So even if I don't rebase this PR, it's this PR+Master that's being reflected in the results.

@davidvossel
Copy link
Contributor Author

How did the test pass on that PR then?

I might have misunderstood the question. If you meant how did the e2e test pass on pr #121 it's because there are no functional tests being executed until this PR is merged. The e2e logic as it is without the functional test suite just verifies that the ocs-operator installs and that the operator is online, not that a storage cluster comes online. That's the part this PR adds, along with verifying PVC creation.

@davidvossel
Copy link
Contributor Author

/test ocs-operator-e2e-aws

@davidvossel
Copy link
Contributor Author

The tests fail waiting for OSD to come up. From the few test logs I saw, the failure happened just as osds were being created. So maybe bump up the timeout for OSDs

I'll up the timeout for now so we can get this PR in.

Independently of this PR I'm going to circle back around and try to gain an understanding for why this is occurring though. It appears like the StorageCluster CR may be reporting that it's in a stable state (available, upgradable, not progressing, and not degrading) at some point before the OSDs actually come online, but the CR is clearly reflecting the correct status of the StorageCluster when the timeout occurs.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2019
@davidvossel
Copy link
Contributor Author

I bumped the setup timeout to 30 minutes, we'll see if this improves the situation.

@davidvossel
Copy link
Contributor Author

I changed the logic to wait an entire hour for the OSDs to come online, and the e2e test is still timing out.

I'm trying to reproduce this locally using a similar cluster to what the e2e tests use. One difference I noticed is that they're only using two AZs, but after talking with @jarrpa it appears this theoretically this should be fine.

@davidvossel
Copy link
Contributor Author

I was able to reproduce what's happening in Openshift CI by limiting my OCP deployment to only 2 availability zones in AWS.

Only 1 or 2 OSDs ever come online. The OSDs that don't come online complain about not being able to attach a PVC to the OSDs pod. Below is an example.

Events:
  Type     Reason              Age                 From                                   Message
  ----     ------              ----                ----                                   -------
  Normal   Scheduled           46m                 default-scheduler                      Successfully assigned openshift-storage/rook-ceph-osd-0-574bb9858b-kfltj to ip-10-0-129-254.ec2.internal
  Warning  FailedAttachVolume  46m                 attachdetach-controller                Multi-Attach error for volume "pvc-230b857f-d8af-11e9-b766-0e15af88a5cc" Volume is already used by pod(s) rook-ceph-osd-prepare-example-deviceset-2-khhxc-22t6n
  Warning  FailedMount         88s (x20 over 44m)  kubelet, ip-10-0-129-254.ec2.internal  Unable to mount volumes for pod "rook-ceph-osd-0-574bb9858b-kfltj_openshift-storage(43ee9141-d8af-11e9-bed1-0a89cbfd94e0)": timeout expired waiting for volumes to attach or mount for pod "openshift-storage"/"rook-ceph-osd-0-574bb9858b-kfltj". list of unmounted volumes=[example-deviceset-2-khhxc]. list of unattached volumes=[rook-data rook-config-override rook-ceph-log devices example-deviceset-2-khhxc example-deviceset-2-khhxc-bridge rook-binaries run-udev rook-ceph-osd-token-fcbk9]

In order to stabilize the functional tests so we can get at least something merged that verifies basic funcitonality, I'm experimenting with reducing the OSD replica count to 1 just for the functional test environment.

@davidvossel davidvossel force-pushed the functional-test-suite branch 5 times, most recently from 1c30366 to 9f4df25 Compare September 17, 2019 16:32
@davidvossel
Copy link
Contributor Author

@jarrpa @obnoxxx Functional tests pass again. I had to deploy a work around for issue #131

The osd issue occurs when we have multiple worker nodes within the same AZ. Openshift CI gives us only 2 AZs and 3 worker nodes, which means we'll always have at least 1 extra node in an AZ. For the time being, in order to work around #131 I only label a single node from each AZ and limit the functional test to Replica 2.

Using Replica 2 for the e2e test suite should only be considered a stop gap. I'm tracking the root cause of the issue in #131 and I've also opened up a discussion with Openshift CI about increasing the AZ count to 3. openshift/release#5047

Here's my suggestion for moving forward.

  1. Merge this PR so we have an added level of confidence when developing ocs-operator changes.
  2. Consider removing the e2e test as a mandatory test. As maintainers, this lets you use the test to gain confidence about risky changes while not being held back by a long running (and complex test) for simple changes. Basically, this is an escape hatch to keep the e2e test from blocking progress.
  3. Change the test to utilize replica 3 once either OSDs fail to come online with multiple worker nodes in the same AZ.  #131 or Expand AWS e2e cluster across 3 availability zones openshift/release#5047 are finalized.

@obnoxxx
Copy link
Contributor

obnoxxx commented Sep 18, 2019

I'm fine with moving forward like this.

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Can you squash the non-vendor commits?

@obnoxxx
Copy link
Contributor

obnoxxx commented Sep 18, 2019

@jarrpa @davidvossel :

Overall looks good. Can you squash the non-vendor commits?

Maybe leave the commit that is taking replica down to 2 separate so that we can revert it later?

Signed-off-by: David Vossel <[email protected]>
Includes initial example test in functests/pvc_create_test.go

Signed-off-by: David Vossel <[email protected]>
@davidvossel
Copy link
Contributor Author

Maybe leave the commit that is taking replica down to 2 separate so that we can revert it later?

@obnoxxx @jarrpa I squashed down the commits to what I thought looked like good logical chunks. I think where a few changes that made sense as standalone commits.

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, jarrpa, obnoxxx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jarrpa
Copy link
Member

jarrpa commented Sep 18, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2019
@openshift-merge-robot openshift-merge-robot merged commit d86f2b9 into red-hat-storage:master Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants