-
Notifications
You must be signed in to change notification settings - Fork 110
Intermittent MountVolume.MountDevice errors on Pod creation #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've experienced this problem again and here are my observations:
So there is some sort of deterioration over time somewhere in a system. |
Small update to my observations:
Pinging @fatih i guess? |
@zarbis thanks for all the report. I know how frustrating this can be, so thanks for providing all this information. It seems like the plugin might fail to attach the volumes to the droplets. There was an issue which was fixed here: #34 This is released on May 29: https://github.com/digitalocean/csi-digitalocean/blob/master/CHANGELOG.md#v011-alpha---may-29th-2018 So I definitely recommend to upgrade to I'm on this and try to fix things when I can reproduce it myself as well. As always please keep posting whatever you find. |
@lohazo this is a completely different error. Can you please open a new issue with your Kubernetes version and the steps you did seeing this. Thanks |
@lohazo also please provide these information:
|
I can confirm the original issue, funny enough I can reliably reproduce it by mounting multiple DO volumes in a single pod simultaneously. Mounting other volumes like CephFS/RBD/HostPath alongside does not seem to cause the issue. I first noticed this while migrating/restoring data from an other cluster. Event Log:
|
Experiencing the same issue: https://github.com/confluentinc/cp-helm-charts (just as an example) does not appear to work with the csi plugin largely because it mounts two volumes for the zookeeper instance. The plugin will fail to attach the volume to the node, and manually attaching does not correct the issue. Update: In the above-mentioned chart I used the old driver (https://github.com/kubernetes-incubator/external-storage/tree/master/digitalocean) in conjunction with this current CSI driver (one volume on csi, the other on the old driver) on the same pod to get around this issue. The pod started, at least, and appears to be working. |
Thanks for the reports. I'm not sure if you're all seeing the same issue. As my understanding, the issue happens when two volumes are mounted to the same pod. I'm now going to try to reproduce this. I'll add more information. Meanwhile, please try to share always the example manifest (pvc's, deployments, etc..) so I can use them on my test cluster to reproduce. I'll add a CONTRIBUTING.md file so people are aware of it. |
I tried the following setup, where a pod is referring to two volumes. They will be mounted to two different mount paths: pod.yaml:
pvc-multiple.yaml:
This is the events I see when do a
Everything works fine in this scenario. Gonna try out other kind of setups. Please let me know if the one you're using is similar to this. Also I would appreciate if you could test the setup above and then maybe provide a case that is created from the manifests above so I can also reproduce. |
Alright, found a way to make it fail with the following setup: pod.yaml:
pvc.yaml:
This are the events I see:
Seems like, it happens if it's mounted under a sub directory under the same root. Such as |
Thanks for following up on this. My setup was with volumes mounted under a common ancestor. apiVersion: v1
kind: Pod
metadata:
name: transfer-pod
spec:
nodeSelector:
kubernetes.io/hostname: k8s-xi-worker-1
containers:
- name: transfer-pod
image: debian
command: ["sleep", "360000"]
volumeMounts:
- name: hostfs
mountPath: /hostfs
- name: pv1
mountPath: /pv/pv1
- name: pv2
mountPath: /pv/pv2
- name: pv3
mountPath: /pv/pv3
volumes:
- name: hostfs
hostPath:
path: /
- name: pv1
persistentVolumeClaim:
claimName: pvc1
- name: pv2
persistentVolumeClaim:
claimName: pvc2
- name: pv3
persistentVolumeClaim:
claimName: pvc3 PVCs look similar to this (I have a different name for the the storage class than you because we are using a custom helm chart instead of manually deploying as described in this repo). apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pv1
labels:
...
spec:
storageClassName: digitalocean-blockstorage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi |
I found the issue, which is a race condition that could happen very often due how the DigitalOcean API works. This is fixed with #61 I tested it and it works fine. I pushed a new image to test with the version
The above command should replace the current image with the |
@fatih, I just tested and still see the same error.
|
@Azuka can you please try this example: #32 (comment) if not please provide your manifests in full deployable version so I can test it myself. This might be or not the same error. Thanks! |
@fatih I'm using the postgres helm chart which mounts at I'm using as-is which uses the default storage provider. |
Thanks @Azuka. I never used helm, is the link you provided something I can deploy via Kubectl or is it a template we need to convert first? |
@fatih I just exported the templates. Please see below: ---
# Source: postgresql/templates/secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: postgres-postgresql
labels:
app: postgresql
chart: postgresql-0.15.0
release: postgres
heritage: Tiller
type: Opaque
data:
postgres-password: "QXplNlJkV1hBcw=="
---
# Source: postgresql/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-postgresql
labels:
app: postgresql
chart: postgresql-0.15.0
release: postgres
heritage: Tiller
data:
---
# Source: postgresql/templates/pvc.yaml
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: postgres-postgresql
labels:
app: postgresql
chart: postgresql-0.15.0
release: postgres
heritage: Tiller
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: "8Gi"
---
# Source: postgresql/templates/svc.yaml
apiVersion: v1
kind: Service
metadata:
name: postgres-postgresql
labels:
app: postgresql
chart: postgresql-0.15.0
release: postgres
heritage: Tiller
spec:
type: ClusterIP
ports:
- name: postgresql
port: 5432
targetPort: postgresql
selector:
app: postgresql
release: postgres
---
# Source: postgresql/templates/deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: postgres-postgresql
labels:
app: postgresql
chart: postgresql-0.15.0
release: postgres
heritage: Tiller
spec:
selector:
matchLabels:
app: postgresql
release: postgres
strategy:
type: Recreate
template:
metadata:
labels:
app: postgresql
release: postgres
spec:
containers:
- name: postgres-postgresql
image: "postgres:9.6.2"
imagePullPolicy: ""
args:
env:
- name: POSTGRES_USER
value: "postgres"
# Required for pg_isready in the health probes.
- name: PGUSER
value: "postgres"
- name: POSTGRES_DB
value: ""
- name: POSTGRES_INITDB_ARGS
value: ""
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-postgresql
key: postgres-password
- name: POD_IP
valueFrom: { fieldRef: { fieldPath: status.podIP } }
ports:
- name: postgresql
containerPort: 5432
livenessProbe:
exec:
command:
- sh
- -c
- exec pg_isready --host $POD_IP
initialDelaySeconds: 60
timeoutSeconds: 5
failureThreshold: 6
readinessProbe:
exec:
command:
- sh
- -c
- exec pg_isready --host $POD_IP
initialDelaySeconds: 5
timeoutSeconds: 3
periodSeconds: 5
resources:
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data/pgdata
subPath: postgresql-db
volumes:
- name: data
persistentVolumeClaim:
claimName: postgres-postgresql
---
# Source: postgresql/templates/networkpolicy.yaml |
Detailed actions if it helps (I'm using a new volume now).
Thanks a lot, @fatih |
We should focus on a simple use case for everyone and go from them. It seems like not all of us here in this issue have the same issue. The fix for the race condition is fixed now, please let us templates that are simple to follow and let's start from there, otherwise it's very hard to track for me everyone's private or public manifests. I'll release a new version (v0.1.4) now and then let's start debugging with this version as it contains a lot of bug fixes. |
I run the case #32 (comment) and still got the same error:
|
@sayhell true, the issue is not the mount path at all. A fix was pushed, can you please try |
Can anyone here comment please after updating to There is a race condition that can happen if there are multiple request against a Droplet. We fixed it with |
I clean up my StorageClass and recreate with
One of the volume has been mount successe, and the other one is failed. |
@sayhell can you please provide the logs of the csi-attacher, csi-provisioner and node? Here are the commands you can use to get the logs:
|
attacher.log
provisioner.log
node.log
|
Lately, we introduce many fixes, one of them that also fixes this issue (see: #61). The problem @sayhell had was an internal DO issue, which has been resolved (thanks @sayhell for providing access to your cluster). I'm closing this issue as it's very hard to track multiple issues in a single GH issue. If any of you still see this, even after upgrading and using the latest CSI version (current stable: |
I'm facing the same issue, Restarting the nodes helped me solve the issue. Thanks @zarbis for the tip 👍 :) |
@fatih we are facing this issue atm, how do we check the version of the CSI in the cluster, namely in context of making sure that we're using the version that contains the fix - alternatively it may have been introduced recently, unless we are seeing this caused due to a different issue. Thanks. |
Just following up with the comment above, I have been able to check the image tag for the running csi containers, and the images + tags are:
It seems that the images versions are newer so this issue could have been re-introduced in a latter cluster? |
The issue should have long been resolved. If you see the same symptoms, it's likely a different problem. Could you please file a new ticket? As Fatih (who doesn't maintain this project anymore) indicated, we shouldn't continue to pile onto this issue anymore as things have become quite unwieldy. Thanks! |
I've set up Rancher 2.0 cluster on DO to test CSI-DO. At first attempt i followed README.md and succeeded with example app. However after trying to do my own stuff I've started consistently getting Pod creation errors. To rule out the unknowns I've wiped all my stuff and returned back to example app and confirmed that i'm consistently getting this result:
However just after waiting for couple hours problem is gone. I wonder if it was Block Storage degradation I've just happened to witness or this is something related to CSI-DO. Unfortunatelly I've wiped out that cluster and after setting up new one example app deploys just fine. I will provide additional info you might need if I witness this problem again.
The text was updated successfully, but these errors were encountered: