Skip to content

fix(workflow/sync): use RWMutex to prevent concurrent map access #14321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ryancurrah
Copy link
Contributor

@ryancurrah ryancurrah commented Mar 21, 2025

This resolves a fatal "concurrent map iteration and map write" error by synchronizing reads and writes to syncLockMap with a read-write mutex. This change has been running in our environment for 7 days (8,764 workflows) without reoccurrence.

Fixes #14300

Motivation

To stop fatal panics when reading and writing to the syncLockMap map at the same time.

Modifications

  • Added a read lock in the CheckWorkflowExistence function.
  • Replaced the exclusive lock with a read-write mutex.
  • For functions that only read syncLockMap, a read lock is used to allow concurrent reads safely.

Verification

  • Built the Argo Workflows controller image with make workflow-controller-image.
  • Pushed it to our Artifactory container registry.
  • Deployed in a non-production cluster (running production-like workloads in “shadow mode”).
  • Monitored logs for 7 days (8,764 workflows) with alerts on the known error message.
  • Observed no recurrence of the concurrency issue.

Documentation

N/A

This resolves a fatal "concurrent map iteration and map write" error
by synchronizing reads and writes to syncLockMap with a read-write
mutex. This change has been running in our environment for 7 days (8,764 workflows)
without reoccurrence.

Issue: argoprojGH-14300
Signed-off-by: Ryan Currah <[email protected]>
@ryancurrah ryancurrah changed the title fix(workflow/sync): use RWMutex to prevent concurrent map access fix(workflow/sync): use RWMutex to prevent concurrent map access Fixes #14300 Mar 21, 2025
@ryancurrah ryancurrah changed the title fix(workflow/sync): use RWMutex to prevent concurrent map access Fixes #14300 fix(workflow/sync): use RWMutex to prevent concurrent map access Mar 21, 2025
@Joibel Joibel requested review from Joibel and Copilot March 21, 2025 15:38
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a fatal "concurrent map iteration and map write" error by switching from a basic mutex to a read-write mutex in order to better synchronize access when workflows are checked, released, or released all at once.

  • The Manager struct now uses a sync.RWMutex.
  • CheckWorkflowExistence now acquires a read lock.
  • The Release and ReleaseAll methods have been updated to use read locks.
Comments suppressed due to low confidence (1)

workflow/sync/sync_manager.go:342

  • If the ReleaseAll method modifies the syncLockMap or its contents, consider using a write lock to ensure safe concurrent modifications.
sm.lock.RLock()

@Joibel Joibel enabled auto-merge (squash) March 21, 2025 15:45
@Joibel
Copy link
Member

Joibel commented Mar 21, 2025

/cherry-pick release-3.6

@Joibel Joibel merged commit 73a8e70 into argoproj:main Mar 21, 2025
37 checks passed
gcp-cherry-pick-bot bot pushed a commit that referenced this pull request Mar 21, 2025
jakkubu pushed a commit to splunk/argo-workflows that referenced this pull request Mar 24, 2025
Joibel pushed a commit that referenced this pull request Mar 24, 2025
kim-codefresh added a commit to codefresh-io/argo-workflows that referenced this pull request May 20, 2025
…abilities fixes (Cr 28355) (#358)

* fix: bump deps for k8schain to fix ecr-login (argoproj#14008) (release-3.6 cherry-pick) (argoproj#14174)

* fix(ci): python sdk release process (release-3.6) (argoproj#14183)

Signed-off-by: Alan Clucas <[email protected]>

* docs: clarify qps/burst on controller (cherry-pick argoproj#14190) (argoproj#14192)

Signed-off-by: Tim Collins <[email protected]>
Co-authored-by: Tim Collins <[email protected]>

* fix(api/jsonschema): use unchanging JSON Schema version (cherry-pick argoproj#14092) (argoproj#14256)

Signed-off-by: Roger Peppe <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>

* fix(api/jsonschema): use working `$id` (cherry-pick argoproj#14257) (argoproj#14258)

Signed-off-by: Roger Peppe <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>

* docs: autogenerate tested k8s versions and centralize config (argoproj#14176) (release-3.6) (argoproj#14262)

Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* chore(deps): bump minio-go to newer version (argoproj#14185) (release-3.6) (argoproj#14261)

Co-authored-by: Vaibhav Kaushik <[email protected]>

* fix: split pod controller from workflow controller (argoproj#14129) (release-3.6) (argoproj#14263)

* chore(deps): fix snyk (argoproj#14264) (release-3.6) (argoproj#14268)

* chore: revert to correct k8s versions

Accidental bump from argoproj#14176 cherry-pick

Signed-off-by: Alan Clucas <[email protected]>

* chore(deps): bump github.com/go-jose/go-jose/v3 from 3.0.3 to 3.0.4 in the go_modules group (cherry-pick argoproj#14231) (argoproj#14269)

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: wait for workflow informer to sync before pod informer (cherry-pick argoproj#14248) (argoproj#14266)

Signed-off-by: Rohan K <[email protected]>
Co-authored-by: Rohan K <[email protected]>

* fix(cli): remove red from log colour selection. Fixes argoproj#6740 (cherry-pick argoproj#14215) (argoproj#14278)

Signed-off-by: Prabakaran Kumaresshan <[email protected]>
Co-authored-by: Prabakaran Kumaresshan <[email protected]>

* fix: correct semaphore configmap keys for multiple semaphores (argoproj#14184) (release-3.6) (argoproj#14281)

* fix: don't print help for non-validation errors. Fixes argoproj#14234 (cherry-pick argoproj#14249) (argoproj#14283)

Signed-off-by: Koichi Shimada <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Co-authored-by: koichi <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* docs: fix kubernetes versions (release-3.6) (argoproj#14273)

Signed-off-by: Alan Clucas <[email protected]>

* fix(workflow/sync): use RWMutex to prevent concurrent map access (cherry-pick argoproj#14321) (argoproj#14322)

Signed-off-by: Ryan Currah <[email protected]>
Co-authored-by: Ryan Currah <[email protected]>

* chore(lint): update golangci-lint to 2.1.1 (argoproj#14390) (cherry-pick release-3.6) (argoproj#14417)

* chore: bump golang 1.23->1.24 (argoproj#14385) (cherry-pick release-3.6) (argoproj#14418)

* fix: gracefully handle invalid CronWorkflows and simplify logic.  (cherry-pick argoproj#14197) (argoproj#14419)

Signed-off-by: Mason Malone <[email protected]>

* fix: prevent dfs sorter infinite recursion on cycle. Fixes argoproj#13395 (cherry-pick argoproj#14391) (argoproj#14420)

Signed-off-by: Adrien Delannoy <[email protected]>
Co-authored-by: Adrien Delannoy <[email protected]>

* chore(deps): bump github.com/expr-lang/expr from 1.16.9 to 1.17.0 (argoproj#14307) (release-3.6) (argoproj#14421)

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps)!: update k8s and argo-events (release-3.6) (argoproj#14424)

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: william.vanhevelingen <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: William Van Hevelingen <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: William Van Hevelingen <[email protected]>
Co-authored-by: Mason Malone <[email protected]>

* fix: correct retry logic (argoproj#13734) (release-3.6) (argoproj#14428)

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>

* fix: manual retries exit handler cleanup. Fixes argoproj#14180 (argoproj#14181) (release-3.6) (argoproj#14429)

Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>

* fix: correct manual retry logic. Fixes argoproj#14124 (argoproj#14328) (release-3.6) (argoproj#14430)

Signed-off-by: oninowang <[email protected]>
Signed-off-by: Alan Clucas <[email protected]>
Co-authored-by: jswxstw <[email protected]>

* fix: disable ALPN in argo-server as a workaround (argoproj#14433)

Signed-off-by: Alan Clucas <[email protected]>

* result of codegen

Signed-off-by: Kim <[email protected]>

* fix:lint

Signed-off-by: Kim <[email protected]>

---------

Signed-off-by: Alan Clucas <[email protected]>
Signed-off-by: Tim Collins <[email protected]>
Signed-off-by: Roger Peppe <[email protected]>
Signed-off-by: Mason Malone <[email protected]>
Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Rohan K <[email protected]>
Signed-off-by: Prabakaran Kumaresshan <[email protected]>
Signed-off-by: Koichi Shimada <[email protected]>
Signed-off-by: Ryan Currah <[email protected]>
Signed-off-by: Adrien Delannoy <[email protected]>
Signed-off-by: william.vanhevelingen <[email protected]>
Signed-off-by: William Van Hevelingen <[email protected]>
Signed-off-by: isubasinghe <[email protected]>
Signed-off-by: oninowang <[email protected]>
Signed-off-by: Kim <[email protected]>
Co-authored-by: Alan Clucas <[email protected]>
Co-authored-by: gcp-cherry-pick-bot[bot] <98988430+gcp-cherry-pick-bot[bot]@users.noreply.github.com>
Co-authored-by: Tim Collins <[email protected]>
Co-authored-by: Roger Peppe <[email protected]>
Co-authored-by: Mason Malone <[email protected]>
Co-authored-by: Vaibhav Kaushik <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Rohan K <[email protected]>
Co-authored-by: Prabakaran Kumaresshan <[email protected]>
Co-authored-by: koichi <[email protected]>
Co-authored-by: Ryan Currah <[email protected]>
Co-authored-by: Adrien Delannoy <[email protected]>
Co-authored-by: William Van Hevelingen <[email protected]>
Co-authored-by: Isitha Subasinghe <[email protected]>
Co-authored-by: jswxstw <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Argo Controller Crash Due to Concurrent Map Access
2 participants