Open
4 of 4 issues completedDescription
We regularly see failures when deploying CloudFormation stack. Some investigation has already happened. An experiment with deploying 10 clusters at the same time shows around 60% failure rate.
The main symptom shows up as a "DRA did not stabilize" error:
Resource handler returned message:
"Resource of type 'AWS::FSx::DataRepositoryAssociation' with identifier 'dra-[...]' did not stabilize."
(RequestToken: [...], HandlerErrorCode: NotStabilized)
A closer look at the DRA failure state reveals the true error to be
Amazon FSx does not have the required permissions for the S3 bucket.
Verify that FSx has correct permissions by checking the bucket policy.`
All buckets that get mounted now have this in their permission set:
{
"Sid": "AllowFSX",
"Effect": "Allow",
"Principal": {
"Service": "fsx.amazonaws.com"
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::[...]/**",
"arn:aws:s3:::[...]"
]
}
Second issue is that when deploying multiple clusters at the same time, the DRA doesn't stabilize in 4 hours. It is unclear whether that's expected and if so what the timeout should be.
To be discussed in the AWS call this Thursday.