Skip to content

Investigate and fix GCP deployment failure in nightly workflow run #3103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
gurevichdmitry opened this issue Mar 17, 2025 · 1 comment · May be fixed by #3264
Open
1 task

Investigate and fix GCP deployment failure in nightly workflow run #3103

gurevichdmitry opened this issue Mar 17, 2025 · 1 comment · May be fixed by #3264
Assignees
Labels
automation Team:Cloud Security Cloud Security team related
Milestone

Comments

@gurevichdmitry
Copy link
Collaborator

Motivation

In the latest runs of the test serverless workflow, the GCP CSPM deployment is failing with the following error:
ERROR: (gcloud.deployment-manager.deployments.describe) ResponseError: code=404, message=The object 'projects/xxx/global/deployments/prd-env-17mar0231' is not found.

This task aims to investigate the issue and fix this error.

Definition of done

  • GCP CSPM deployment completes successfully without errors.
@gurevichdmitry gurevichdmitry added the Team:Cloud Security Cloud Security team related label Mar 17, 2025
@gurevichdmitry gurevichdmitry self-assigned this Mar 17, 2025
@gurevichdmitry gurevichdmitry added this to the 9.0 milestone Mar 17, 2025
@gurevichdmitry gurevichdmitry changed the title Investigate and fix GCP deployment failure in daily workflow run Investigate and fix GCP deployment failure in nightly workflow run Mar 17, 2025
@gurevichdmitry
Copy link
Collaborator Author

Since the error is not easily reproducible, I created a script that repeatedly executes the installation and destruction of the Elastic Agent using the deploy.sh script. In 20 runs, the issue occurred twice.

Debug logs did not reveal anything new. During deployment, Deployment Manager performs status checks to determine if the operation is still running or completed. In the failed cases, the issue was related to binding a policy to a service account that was not yet available. I verified in the cloud console that the service account was created, which suggests this is likely a timing issue — the flow fails because the service account isn’t fully available at the time the binding is attempted.

According to Google’s notice, Deployment Manager will be deprecated on December 31, 2025. Given that, I don't think further investigation is necessary. To resolve this issue, we can use our existing feature to create the service account before executing the Deployment Manager script. This should work around the flakiness and enable stable workflow execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation Team:Cloud Security Cloud Security team related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant