Skip to content

Cluster deployment issues #32

Open
4 of 4 issues completed
Open
4 of 4 issues completed
@heerener

Description

@heerener

We regularly see failures when deploying CloudFormation stack. Some investigation has already happened. An experiment with deploying 10 clusters at the same time shows around 60% failure rate.

The main symptom shows up as a "DRA did not stabilize" error:

Resource handler returned message: 
"Resource of type 'AWS::FSx::DataRepositoryAssociation' with identifier 'dra-[...]' did not stabilize." 
(RequestToken: [...], HandlerErrorCode: NotStabilized)

A closer look at the DRA failure state reveals the true error to be

Amazon FSx does not have the required permissions for the S3 bucket. 
Verify that FSx has correct permissions by checking the bucket policy.`

All buckets that get mounted now have this in their permission set:

{
    "Sid": "AllowFSX",
    "Effect": "Allow",
    "Principal": {
         "Service": "fsx.amazonaws.com"
    },
    "Action": "s3:*",
    "Resource": [
         "arn:aws:s3:::[...]/**",
         "arn:aws:s3:::[...]"
    ]
}

Second issue is that when deploying multiple clusters at the same time, the DRA doesn't stabilize in 4 hours. It is unclear whether that's expected and if so what the timeout should be.

To be discussed in the AWS call this Thursday.

Sub-issues

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions