Fix: Introduce `StagingLockfile` to resolve overlay staging lock memory leak #2336

Honny1 · 2025-05-27T09:39:22Z

This PR introduces StagingLockfile, a new internal lockfile designed to manage locking for temporary directories more effectively and to resolve a memory leak. The leak was caused by pruned lockfiles from overlay staging remaining in the global cache.

Key methods provided by StagingLockfile:

CreateAndLock: Creates and locks a new unique file.
TryLockPath: Attempts to lock an existing file. If the file does not exist, it will be created.
UnlockAndDelete: Unlocks and deletes the file (fixes the memory leak).

An internal filelock component has been introduced to encapsulate common primitives for lock file operations. This allows StagingLockfile and Lockfile to share core locking mechanisms, reducing code duplication.

mtrmac

Thanks!

The mechanism of the implementation looks good overall.

The primary purpose of doing this separate implementation was to ensure that the global stagingLockFile map does not grow without bounds. So far, I don’t see that this package reliably achieves that — it does seem to be true for the drivers/overlay caller, but there is no guarantee.

I’d expect

explicit documentation on the constructors (GetStagingLockFile/CreateAndLock/TryLockExisting?) for what the caller must do to release memory again.
~All tests that test the expected ways to use the API (not AssertPanics) to end with an assertion that stagingLockFile is empty = we deallocated everything.
- And I rather suspect that some of the implementation is leaking entries, at least on error paths - e.g. the retry path in CreateAndLock.

I think it’s very likely that the staging lock package doesn’t need all of the existing operations (no need for blocking Lock or plain Unlock, and maybe more such opportunities?); along with removal of the recursive read locking capability, that could allow simplifying things.

Note to self: I didn’t review the test coverage as a whole, checking whether it covers the primary use cases.

internal/staging_lockfile/staging_lockfile.go

internal/staging_lockfile/staging_lockfile_test.go

mtrmac · 2025-05-27T20:10:43Z

drivers/overlay/overlay.go

@@ -2233,7 +2229,7 @@ func (d *Driver) ApplyDiffWithDiffer(options *graphdriver.ApplyDiffWithDifferOpt
 		return graphdriver.DriverWithDifferOutput{}, err
 	}

-	lock, err := lockfile.GetLockFile(filepath.Join(layerDir, stagingLockFile))
+	lock, err := staging_lockfile.GetStagingLockFile(filepath.Join(layerDir, stagingLockFile))


This has the create lock vs. cleanup race, doesn’t it? (Sure, it’s pre-existing.)

internal/staging_lockfile/staging_lockfile.go

giuseppe

could you please squash patches that refactor code added as part of a previous patch in the PR?

internal/staging_lockfile/staging_lockfile.go

mtrmac · 2025-05-28T16:13:10Z

could you please squash patches that refactor code added as part of a previous patch in the PR?

That might have been my fault, I have asked for copies / moves of pre-existing code to be extra commits. There are various ways to structure this (maybe start with extracting filelock)… I don’t know how much effort is it worth now that the commits exist.

mtrmac · 2025-05-28T16:15:47Z

I think it’s very likely that the staging lock package doesn’t need all of the existing operations (no need for blocking Lock or plain Unlock, and maybe more such opportunities?); along with removal of the recursive read locking capability, that could allow simplifying things.

As a guess, consider the following invariant

Unless globalMapMutex is currently locked, an entry for a file exists in globalMap iff the current process currently holds the lock for that file.

I think that might suffice to do everything we need, at the cost of 1 in-process mutex + file lock; but it’s very possible I have overlooked something.

giuseppe · 2025-05-28T19:46:54Z

That might have been my fault, I have asked for copies / moves of pre-existing code to be extra commits. There are various ways to structure this (maybe start with extracting filelock)… I don’t know how much effort is it worth now that the commits exist.

ok then no problem, having the move on its own commit could be useful for git bisect

internal/rawfilelock/rawfilelock_test.go

internal/staging_lockfile/staging_lockfile.go

internal/staging_lockfile/staging_lockfile_test.go

internal/staging_lockfile/staging_lockfile.go

Honny1 · 2025-06-04T13:20:11Z

@mtrmac I have addressed your comments and added a test that tests TryLockPath from another process.

mtrmac

Thanks, this looks great.

One important problem in CreateAndLock; and it might be more convenient to change the CreateAndLock API [but we can tune that later when adding new users]. Otherwise, mostly nits.

internal/staging_lockfile/staging_lockfile.go

internal/staging_lockfile/staging_lockfile_test.go

internal/staging_lockfile/staging_lockfile.go

mtrmac

Close to done I think.

@giuseppe PTANL.

internal/staging_lockfile/staging_lockfile.go

Signed-off-by: Jan Rodák <[email protected]>

This commit refactors the StagingLockfile component: - Fix test, functions names. - Removed blocking Lock, plain Unlock and Read lock mechanism. - Updated comments to reflect the current logic and usage. Signed-off-by: Jan Rodák <[email protected]>

Signed-off-by: Jan Rodák <[email protected]>

Honny1 · 2025-06-09T07:33:44Z

I fixed the last nits.

mtrmac

LGTM, nice work.

(Hoping for another review due to how critical this is.)

giuseppe

great work!

/lgtm

openshift-ci · 2025-06-09T10:32:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Honny1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [giuseppe]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the do-not-merge/work-in-progress label May 27, 2025

Honny1 force-pushed the staging-lock branch from 7e2e3d2 to 887e34d Compare May 27, 2025 12:11

Honny1 marked this pull request as ready for review May 27, 2025 12:53

openshift-ci bot removed the do-not-merge/work-in-progress label May 27, 2025

mtrmac reviewed May 27, 2025

View reviewed changes

giuseppe reviewed May 28, 2025

View reviewed changes

internal/staging_lockfile/staging_lockfile.go Outdated Show resolved Hide resolved

Honny1 force-pushed the staging-lock branch 3 times, most recently from b15a9af to ac2aba9 Compare June 2, 2025 16:49

mtrmac reviewed Jun 2, 2025

View reviewed changes

Honny1 force-pushed the staging-lock branch 2 times, most recently from 40a3d3e to 360f6a3 Compare June 4, 2025 09:26

mtrmac reviewed Jun 4, 2025

View reviewed changes

Honny1 force-pushed the staging-lock branch from 360f6a3 to 6b0f9f1 Compare June 5, 2025 09:50

mtrmac reviewed Jun 5, 2025

View reviewed changes

Honny1 added 6 commits June 9, 2025 09:27

Create staging_lockfile from lockfile

de8761d

Signed-off-by: Jan Rodák <[email protected]>

Remove deprecated API and LastWrite functionality

02398d4

Signed-off-by: Jan Rodák <[email protected]>

Deduplicate code

2a216f8

Signed-off-by: Jan Rodák <[email protected]>

Add TryLockPath, CreateAndLock, UnlockAndDelete functions

bcbecba

Signed-off-by: Jan Rodák <[email protected]>

Replace LockFile with StagingLockfile for overlay staging

b3638ef

Signed-off-by: Jan Rodák <[email protected]>

Honny1 force-pushed the staging-lock branch from 6b0f9f1 to b3638ef Compare June 9, 2025 07:32

mtrmac reviewed Jun 9, 2025

View reviewed changes

giuseppe approved these changes Jun 9, 2025

View reviewed changes

openshift-ci bot assigned giuseppe Jun 9, 2025

openshift-ci bot added the lgtm label Jun 9, 2025

openshift-ci bot added the approved label Jun 9, 2025

openshift-merge-bot bot merged commit 5aa4986 into containers:main Jun 9, 2025
20 checks passed

Fix: Introduce StagingLockfile to resolve overlay staging lock memory leak #2336

Fix: Introduce StagingLockfile to resolve overlay staging lock memory leak #2336

Uh oh!

Conversation

Honny1 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtrmac May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mtrmac commented May 28, 2025

Uh oh!

mtrmac commented May 28, 2025

Uh oh!

giuseppe commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Honny1 commented Jun 4, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Honny1 commented Jun 9, 2025

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Jun 9, 2025

Uh oh!

Uh oh!

Uh oh!

Fix: Introduce `StagingLockfile` to resolve overlay staging lock memory leak #2336

Fix: Introduce `StagingLockfile` to resolve overlay staging lock memory leak #2336

Honny1 commented May 27, 2025 •

edited

Loading