-
Notifications
You must be signed in to change notification settings - Fork 258
Fix: Introduce StagingLockfile
to resolve overlay staging lock memory leak
#2336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The mechanism of the implementation looks good overall.
The primary purpose of doing this separate implementation was to ensure that the global stagingLockFile
map does not grow without bounds. So far, I don’t see that this package reliably achieves that — it does seem to be true for the drivers/overlay
caller, but there is no guarantee.
I’d expect
- explicit documentation on the constructors (
GetStagingLockFile
/CreateAndLock
/TryLockExisting
?) for what the caller must do to release memory again. - ~All tests that test the expected ways to use the API (not
AssertPanics
) to end with an assertion thatstagingLockFile
is empty = we deallocated everything.- And I rather suspect that some of the implementation is leaking entries, at least on error paths - e.g. the retry path in
CreateAndLock
.
- And I rather suspect that some of the implementation is leaking entries, at least on error paths - e.g. the retry path in
I think it’s very likely that the staging lock package doesn’t need all of the existing operations (no need for blocking Lock
or plain Unlock
, and maybe more such opportunities?); along with removal of the recursive read locking capability, that could allow simplifying things.
Note to self: I didn’t review the test coverage as a whole, checking whether it covers the primary use cases.
drivers/overlay/overlay.go
Outdated
@@ -2233,7 +2229,7 @@ func (d *Driver) ApplyDiffWithDiffer(options *graphdriver.ApplyDiffWithDifferOpt | |||
return graphdriver.DriverWithDifferOutput{}, err | |||
} | |||
|
|||
lock, err := lockfile.GetLockFile(filepath.Join(layerDir, stagingLockFile)) | |||
lock, err := staging_lockfile.GetStagingLockFile(filepath.Join(layerDir, stagingLockFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the create lock vs. cleanup race, doesn’t it? (Sure, it’s pre-existing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please squash patches that refactor code added as part of a previous patch in the PR?
That might have been my fault, I have asked for copies / moves of pre-existing code to be extra commits. There are various ways to structure this (maybe start with extracting |
As a guess, consider the following invariant
I think that might suffice to do everything we need, at the cost of 1 in-process mutex + file lock; but it’s very possible I have overlooked something. |
ok then no problem, having the move on its own commit could be useful for git bisect |
b15a9af
to
ac2aba9
Compare
40a3d3e
to
360f6a3
Compare
@mtrmac I have addressed your comments and added a test that tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks great.
One important problem in CreateAndLock
; and it might be more convenient to change the CreateAndLock
API [but we can tune that later when adding new users]. Otherwise, mostly nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to done I think.
@giuseppe PTANL.
Signed-off-by: Jan Rodák <[email protected]>
Signed-off-by: Jan Rodák <[email protected]>
This commit refactors the StagingLockfile component: - Fix test, functions names. - Removed blocking Lock, plain Unlock and Read lock mechanism. - Updated comments to reflect the current logic and usage. Signed-off-by: Jan Rodák <[email protected]>
Signed-off-by: Jan Rodák <[email protected]>
Signed-off-by: Jan Rodák <[email protected]>
Signed-off-by: Jan Rodák <[email protected]>
I fixed the last nits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work.
(Hoping for another review due to how critical this is.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work!
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Honny1 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR introduces
StagingLockfile
, a new internal lockfile designed to manage locking for temporary directories more effectively and to resolve a memory leak. The leak was caused by pruned lockfiles from overlay staging remaining in the global cache.Key methods provided by
StagingLockfile
:CreateAndLock
: Creates and locks a new unique file.TryLockPath
: Attempts to lock an existing file. If the file does not exist, it will be created.UnlockAndDelete
: Unlocks and deletes the file (fixes the memory leak).An internal
filelock
component has been introduced to encapsulate common primitives for lock file operations. This allowsStagingLockfile
andLockfile
to share core locking mechanisms, reducing code duplication.