Skip to content

Fix: Introduce StagingLockfile to resolve overlay staging lock memory leak #2336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 9, 2025

Conversation

Honny1
Copy link
Member

@Honny1 Honny1 commented May 27, 2025

This PR introduces StagingLockfile, a new internal lockfile designed to manage locking for temporary directories more effectively and to resolve a memory leak. The leak was caused by pruned lockfiles from overlay staging remaining in the global cache.

Key methods provided by StagingLockfile:

  • CreateAndLock: Creates and locks a new unique file.
  • TryLockPath: Attempts to lock an existing file. If the file does not exist, it will be created.
  • UnlockAndDelete: Unlocks and deletes the file (fixes the memory leak).

An internal filelock component has been introduced to encapsulate common primitives for lock file operations. This allows StagingLockfile and Lockfile to share core locking mechanisms, reducing code duplication.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

The mechanism of the implementation looks good overall.

The primary purpose of doing this separate implementation was to ensure that the global stagingLockFile map does not grow without bounds. So far, I don’t see that this package reliably achieves that — it does seem to be true for the drivers/overlay caller, but there is no guarantee.

I’d expect

  • explicit documentation on the constructors (GetStagingLockFile/CreateAndLock/TryLockExisting?) for what the caller must do to release memory again.
  • ~All tests that test the expected ways to use the API (not AssertPanics) to end with an assertion that stagingLockFile is empty = we deallocated everything.
    • And I rather suspect that some of the implementation is leaking entries, at least on error paths - e.g. the retry path in CreateAndLock.

I think it’s very likely that the staging lock package doesn’t need all of the existing operations (no need for blocking Lock or plain Unlock, and maybe more such opportunities?); along with removal of the recursive read locking capability, that could allow simplifying things.


Note to self: I didn’t review the test coverage as a whole, checking whether it covers the primary use cases.

@@ -2233,7 +2229,7 @@ func (d *Driver) ApplyDiffWithDiffer(options *graphdriver.ApplyDiffWithDifferOpt
return graphdriver.DriverWithDifferOutput{}, err
}

lock, err := lockfile.GetLockFile(filepath.Join(layerDir, stagingLockFile))
lock, err := staging_lockfile.GetStagingLockFile(filepath.Join(layerDir, stagingLockFile))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the create lock vs. cleanup race, doesn’t it? (Sure, it’s pre-existing.)

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please squash patches that refactor code added as part of a previous patch in the PR?

@mtrmac
Copy link
Collaborator

mtrmac commented May 28, 2025

could you please squash patches that refactor code added as part of a previous patch in the PR?

That might have been my fault, I have asked for copies / moves of pre-existing code to be extra commits. There are various ways to structure this (maybe start with extracting filelock)… I don’t know how much effort is it worth now that the commits exist.

@mtrmac
Copy link
Collaborator

mtrmac commented May 28, 2025

I think it’s very likely that the staging lock package doesn’t need all of the existing operations (no need for blocking Lock or plain Unlock, and maybe more such opportunities?); along with removal of the recursive read locking capability, that could allow simplifying things.

As a guess, consider the following invariant

Unless globalMapMutex is currently locked, an entry for a file exists in globalMap iff the current process currently holds the lock for that file.

I think that might suffice to do everything we need, at the cost of 1 in-process mutex + file lock; but it’s very possible I have overlooked something.

@giuseppe
Copy link
Member

That might have been my fault, I have asked for copies / moves of pre-existing code to be extra commits. There are various ways to structure this (maybe start with extracting filelock)… I don’t know how much effort is it worth now that the commits exist.

ok then no problem, having the move on its own commit could be useful for git bisect

@Honny1 Honny1 force-pushed the staging-lock branch 3 times, most recently from b15a9af to ac2aba9 Compare June 2, 2025 16:49
@Honny1 Honny1 force-pushed the staging-lock branch 2 times, most recently from 40a3d3e to 360f6a3 Compare June 4, 2025 09:26
@Honny1
Copy link
Member Author

Honny1 commented Jun 4, 2025

@mtrmac I have addressed your comments and added a test that tests TryLockPath from another process.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great.

One important problem in CreateAndLock; and it might be more convenient to change the CreateAndLock API [but we can tune that later when adding new users]. Otherwise, mostly nits.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close to done I think.

@giuseppe PTANL.

Honny1 added 6 commits June 9, 2025 09:27
This commit refactors the StagingLockfile component:
- Fix test, functions names.
- Removed blocking Lock, plain Unlock and Read lock  mechanism.
- Updated comments to reflect the current logic and usage.

Signed-off-by: Jan Rodák <[email protected]>
Signed-off-by: Jan Rodák <[email protected]>
@Honny1
Copy link
Member Author

Honny1 commented Jun 9, 2025

I fixed the last nits.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work.

(Hoping for another review due to how critical this is.)

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work!

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 9, 2025
Copy link
Contributor

openshift-ci bot commented Jun 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Honny1

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jun 9, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 5aa4986 into containers:main Jun 9, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants