pyramid: fix file handle and goroutine leaks, add proper cleanup #9787

nopcoder · 2025-12-13T02:36:17Z

Summary

This PR fixes and improves the pyramid package, which handles the local caching layer for range and meta-range files.

Key Changes

Resource Management:

Add Close() method to FS interface and TierFS implementation to properly clean up resources
Add Close() method to Eviction interface and ristretto eviction implementation
Properly close rangeFS and metaRangeFS in catalog initialization on error and during shutdown
Fix file handle leak in store() by ensuring files are closed after adapter.Put()

Goroutine Leak Fix:

Replace unbounded goroutine spawning for directory deletion with a bounded worker pattern
Add dirDeleteWorker background goroutine with proper shutdown signaling via done channel and sync.WaitGroup
Add buffered channel (dirDeleteChSize=1000) to queue directory deletion requests
Add metric errorsTotal{type="DirDeleteQueueFull"} to track when directory deletions are skipped due to full queue

Context Cancellation Fix:

Use context.Background() in openWithLock when fetching from block storage inside Compute()
The ChanOnlyOne.Compute pattern shares results across multiple callers; using a caller's context could cause one caller's cancellation to fail the fetch for all waiting callers

Bug Fixes:

Fix missing .Inc() calls on error metric counters (errorsTotal.WithLabelValues(...).Inc())
Fix incorrect path handling in storeLocalFile() - was using relative path instead of full path for file removal
Use filepath.Join instead of path.Join for OS-specific path handling throughout the codebase

Code Quality:

Simplify numCounters calculation in eviction using max() builtin
Replace interface{} with any type alias
Remove unnecessary else after return statements
Add documentation comments to localFileRef struct fields
Update mock implementations to include new Close() method

Fix https://github.com/treeverse/lakeFS-Enterprise/issues/1426
Fix #9824

arielshaqed

reviewing before it's time - but these comments might be relevant no matter what we do.

arielshaqed · 2025-12-13T09:41:22Z

pkg/pyramid/tier_fs.go

 	if err := os.Remove(p); err != nil {
 		tfs.logger.WithError(err).WithField("path", p).Error("Removing file failed")
-		errorsTotal.WithLabelValues(tfs.fsName, "FileRemoval")
+		errorsTotal.WithLabelValues(tfs.fsName, "FileRemoval").Inc()


arielshaqed · 2025-12-13T09:53:22Z

pkg/pyramid/tier_fs.go

+	// dirDeleteChSize is the buffer size for the directory deletion channel
+	dirDeleteChSize = 1000


I am of 2 minds about this: deletion can take a long time. I'd understand a really short queue. OTOH, as we delay longer to deleted means that we risk going over allowed disk space. 1000 directories to delete is a lot.

At the minimum, perhaps monitor the length of the queue?

Currently we do not delete in async, learned it when I looked at what we actually doing in the context of the goroutine.

The new implementation is doing best effort of having a background work to check and cleanup empty folders.

Now I just wonder if we can do have a periodic check that will scan the local directory structure and cleanup empty directories as needed will be a better option.

PS. we log in case we try to push work to the queue and the queue is full - I'll add metrics too for easy monitoring.

arielshaqed · 2025-12-13T10:04:38Z

pkg/pyramid/tier_fs.go

-			tfs.logger.WithError(err).Error("Failed deleting empty dir")
-			errorsTotal.WithLabelValues(tfs.fsName, "DirRemoval")
+	// Queue directory deletion (non-blocking)
+	dirPath := filepath.Dir(string(rPath))


I don't understand why we would ever want to delete an entire directory tree at a time. It is not faster to delete multiple files than it is to delete a single file. AFAICT the worker could delete each file separately, it would be no slower.

Didn't modified the functionality. The link is to the current code that runs in the go routine, it calls deleteDirRecIfEmpty which only ensure that the directory will be removed if it is empty.

lakeFS/pkg/pyramid/tier_fs.go

Lines 144 to 148 in 9016dbf

go func() {

if err := tfs.syncDir.deleteDirRecIfEmpty(path.Dir(string(rPath))); err != nil {

tfs.logger.WithError(err).Error("Failed deleting empty dir")

errorsTotal.WithLabelValues(tfs.fsName, "DirRemoval")

}

…ing (#9789) * imporve pyramid delete file tracking * apply code review changes

nopcoder · 2025-12-16T06:37:37Z

Merged approved smaller pr into this one. d3558b8

AliRamberg · 2025-12-16T08:29:22Z

How does this PR solve the mergeIntoBranch errors? This seems to solve other issues we have in pyramid.

nopcoder · 2025-12-16T08:41:05Z

How does this PR solve the mergeIntoBranch errors? This seems to solve other issues we have in pyramid.

The final PR solves multiple issues; the error in mergeIntoBranch occurs when we request to metadata during merge, in pyramid when we fetch data from s3 during openWithLock we make sure that one caller (the first) will perform the download. When the caller request cancelled (in our case timed-out) the download stops and return cancellation error to all callers.
This can occurs a lot when there are multiple merges into the same branch at the same time.
In order not to failed the "single-flight" request this pr updated the get to use background context.

AliRamberg

I'm pleased, and it does look very good to me :content:

AliRamberg · 2025-12-16T12:34:30Z

pkg/pyramid/tier_fs.go

+	defer tfs.wg.Done()
+	for {
+		select {
+		case <-tfs.done:


Is this intentional? Doesn't it mean we could Close TierFS while dirDeleteCh is still not empty?

yes, when we shutdown (close tier fs) we stop managing the cache and when we start we read/track everything back.

keeping the metrics to monitor issues related to how we cleanup empty directories.

Fix and improve the pyramid package

9bedef7

nopcoder self-assigned this Dec 13, 2025

github-actions bot added area/cataloger Improvements or additions to the cataloger area/testing Improvements or additions to tests labels Dec 13, 2025

arielshaqed reviewed Dec 13, 2025

View reviewed changes

nopcoder added 2 commits December 13, 2025 15:22

Use background context in Compute to prevent cross-caller cancellation

b2bd4ae

Add metric for directory deletion queue overflow

23d6378

nopcoder requested a review from arielshaqed December 13, 2025 13:31

nopcoder changed the title ~~Fix and improve the pyramid package~~ pyramid: fix file handle and goroutine leaks, add proper cleanup Dec 13, 2025

nopcoder added the include-changelog PR description should be included in next release changelog label Dec 14, 2025

Fix race conditions between OnReject callback and file deletion handl…

d3558b8

…ing (#9789) * imporve pyramid delete file tracking * apply code review changes

nopcoder requested a review from a team December 16, 2025 06:36

AliRamberg approved these changes Dec 16, 2025

View reviewed changes

reduce the logging while trying to cleanup empty directories

301f707

keeping the metrics to monitor issues related to how we cleanup empty directories.

nopcoder merged commit 5c93f72 into master Dec 16, 2025
67 of 70 checks passed

nopcoder deleted the fix/pyramid branch December 16, 2025 20:23

		// dirDeleteChSize is the buffer size for the directory deletion channel
		dirDeleteChSize = 1000

	go func() {
	if err := tfs.syncDir.deleteDirRecIfEmpty(path.Dir(string(rPath))); err != nil {
	tfs.logger.WithError(err).Error("Failed deleting empty dir")
	errorsTotal.WithLabelValues(tfs.fsName, "DirRemoval")
	}

pyramid: fix file handle and goroutine leaks, add proper cleanup #9787

pyramid: fix file handle and goroutine leaks, add proper cleanup #9787

Conversation

nopcoder commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Uh oh!

arielshaqed left a comment

Choose a reason for hiding this comment

Uh oh!

arielshaqed Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

arielshaqed Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

nopcoder Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arielshaqed Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

nopcoder Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

nopcoder commented Dec 16, 2025

Uh oh!

AliRamberg commented Dec 16, 2025

Uh oh!

nopcoder commented Dec 16, 2025

Uh oh!

AliRamberg left a comment

Choose a reason for hiding this comment

Uh oh!

AliRamberg Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

nopcoder Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nopcoder commented Dec 13, 2025 •

edited

Loading

nopcoder Dec 13, 2025 •

edited

Loading