-
Notifications
You must be signed in to change notification settings - Fork 413
pyramid: fix file handle and goroutine leaks, add proper cleanup #9787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
arielshaqed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewing before it's time - but these comments might be relevant no matter what we do.
| if err := os.Remove(p); err != nil { | ||
| tfs.logger.WithError(err).WithField("path", p).Error("Removing file failed") | ||
| errorsTotal.WithLabelValues(tfs.fsName, "FileRemoval") | ||
| errorsTotal.WithLabelValues(tfs.fsName, "FileRemoval").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
| // dirDeleteChSize is the buffer size for the directory deletion channel | ||
| dirDeleteChSize = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am of 2 minds about this: deletion can take a long time. I'd understand a really short queue. OTOH, as we delay longer to deleted means that we risk going over allowed disk space. 1000 directories to delete is a lot.
At the minimum, perhaps monitor the length of the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we do not delete in async, learned it when I looked at what we actually doing in the context of the goroutine.
The new implementation is doing best effort of having a background work to check and cleanup empty folders.
Now I just wonder if we can do have a periodic check that will scan the local directory structure and cleanup empty directories as needed will be a better option.
PS. we log in case we try to push work to the queue and the queue is full - I'll add metrics too for easy monitoring.
| tfs.logger.WithError(err).Error("Failed deleting empty dir") | ||
| errorsTotal.WithLabelValues(tfs.fsName, "DirRemoval") | ||
| // Queue directory deletion (non-blocking) | ||
| dirPath := filepath.Dir(string(rPath)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we would ever want to delete an entire directory tree at a time. It is not faster to delete multiple files than it is to delete a single file. AFAICT the worker could delete each file separately, it would be no slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't modified the functionality. The link is to the current code that runs in the go routine, it calls deleteDirRecIfEmpty which only ensure that the directory will be removed if it is empty.
Lines 144 to 148 in 9016dbf
| go func() { | |
| if err := tfs.syncDir.deleteDirRecIfEmpty(path.Dir(string(rPath))); err != nil { | |
| tfs.logger.WithError(err).Error("Failed deleting empty dir") | |
| errorsTotal.WithLabelValues(tfs.fsName, "DirRemoval") | |
| } |
…ing (#9789) * imporve pyramid delete file tracking * apply code review changes
|
Merged approved smaller pr into this one. d3558b8 |
|
How does this PR solve the mergeIntoBranch errors? This seems to solve other issues we have in pyramid. |
The final PR solves multiple issues; the error in mergeIntoBranch occurs when we request to metadata during merge, in pyramid when we fetch data from s3 during openWithLock we make sure that one caller (the first) will perform the download. When the caller request cancelled (in our case timed-out) the download stops and return cancellation error to all callers. |
AliRamberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pleased, and it does look very good to me :content:
| defer tfs.wg.Done() | ||
| for { | ||
| select { | ||
| case <-tfs.done: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional? Doesn't it mean we could Close TierFS while dirDeleteCh is still not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, when we shutdown (close tier fs) we stop managing the cache and when we start we read/track everything back.
keeping the metrics to monitor issues related to how we cleanup empty directories.
Summary
This PR fixes and improves the pyramid package, which handles the local caching layer for range and meta-range files.
Key Changes
Resource Management:
Close()method toFSinterface andTierFSimplementation to properly clean up resourcesClose()method toEvictioninterface and ristretto eviction implementationrangeFSandmetaRangeFSin catalog initialization on error and during shutdownstore()by ensuring files are closed afteradapter.Put()Goroutine Leak Fix:
dirDeleteWorkerbackground goroutine with proper shutdown signaling viadonechannel andsync.WaitGroupdirDeleteChSize=1000) to queue directory deletion requestserrorsTotal{type="DirDeleteQueueFull"}to track when directory deletions are skipped due to full queueContext Cancellation Fix:
context.Background()inopenWithLockwhen fetching from block storage insideCompute()ChanOnlyOne.Computepattern shares results across multiple callers; using a caller's context could cause one caller's cancellation to fail the fetch for all waiting callersBug Fixes:
.Inc()calls on error metric counters (errorsTotal.WithLabelValues(...).Inc())storeLocalFile()- was using relative path instead of full path for file removalfilepath.Joininstead ofpath.Joinfor OS-specific path handling throughout the codebaseCode Quality:
numCounterscalculation in eviction usingmax()builtininterface{}withanytype aliaselseafterreturnstatementslocalFileRefstruct fieldsClose()methodFix https://github.com/treeverse/lakeFS-Enterprise/issues/1426
Fix #9824