Skip to content

Fix remote git tree reuse after prune#13141

Open
sipsma wants to merge 5 commits into
dagger:mainfrom
sipsma:fix-git-prune
Open

Fix remote git tree reuse after prune#13141
sipsma wants to merge 5 commits into
dagger:mainfrom
sipsma:fix-git-prune

Conversation

@sipsma
Copy link
Copy Markdown
Contributor

@sipsma sipsma commented May 13, 2026

Summary

  • Add an integration repro for a remote git tree failing after explicit local cache prune.
  • Remove the separate git-snapshot metadata fast path from core RemoteGitRef.Tree.
  • Rely on the persistable dagql GitRef.tree result as the persistent cache authority for remote git trees.

Details

@vito found and reported a failure where a remote git tree could be fetched successfully, pruned, and then fail on the next fetch with a raw snapshot key not found error.

The repro starts an isolated nested engine, fetches a pinned github.com/dagger/dagger commit in one client session, closes that session, explicitly prunes from another session, then fetches the same git tree again from a fresh session. Before the fix, the second fetch reproduced the stale snapshot lookup failure.

The cause was a redundant snapshot metadata cache path in core remote git tree checkout logic. That path was carried over conceptually from the previous system: it indexed checkout snapshots by resolved git inputs, then reopened matching snapshots directly. That index was not a dagql retention edge. When dagql pruning removed the persisted GitRef.tree result, it correctly released the Directory snapshot lease and the snapshot could be deleted, but the metadata index could still point at the deleted snapshot.

The fix removes that fast path. GitRef.tree is already a persistable dagql field, so if dagql retains the result, the resolver is not run. If dagql has pruned the result, the resolver now performs a fresh checkout instead of consulting stale out-of-band snapshot metadata.

Tests

  • go test ./core -run '^$'
  • go test ./core/integration -run TestEngine -count=0
  • dagger --progress=plain call engine-dev test --pkg ./core/integration --run='TestEngine/TestLocalCachePruneRemoteGitSnapshot' --timeout='30m'
  • dagger --progress=plain call engine-dev test --pkg ./core/integration --run='TestGit/TestCaching' --timeout='30m'

@sipsma sipsma added this to the v0.21.0 milestone May 13, 2026
sipsma added 4 commits May 13, 2026 16:04
The Discord investigation pointed to an engine state where dagql pruning removes the Directory snapshot produced by GitRef.tree while git snapshot metadata still maps the remote git cache key to that removed snapshot. A later remote git checkout can then hit the stale metadata entry and fail with a raw snapshot "not found" error instead of treating the entry as a cache miss and fetching again.

Add an EngineSuite integration repro in localcache_test.go. The test starts an isolated nested engine with automatic GC disabled, fetches a pinned github.com/dagger/dagger commit in one nested client session, closes that session so its persisted result can be pruned, explicitly prunes from a second session, and fetches the same git tree again from a third session.

The test intentionally uses a real remote git repository instead of ExperimentalServiceHost so it exercises the normal Query.git and GitRef.tree path. On the current code it fails after prune with the removed snapshot key, matching the production logs from the investigation.

Verified by running:

dagger --progress=plain call engine-dev test --pkg ./core/integration --run='TestEngine/TestLocalCachePruneRemoteGitSnapshot' --timeout='30m'

That command currently fails with the expected stale snapshot lookup.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
RemoteGitRef.Tree kept a separate snapshot metadata cache keyed by the resolved ref, remote URL, checkout depth, tag inclusion, and discardGitDir option. That path searched git-snapshot metadata before doing a checkout and reopened the indexed snapshot directly by snapshot ID. This was a conceptual carryover from the previous implementation model, separate from dagql's own result retention and pruning semantics.

That metadata index was not a dagql retention edge. When dagql pruning removed a persisted GitRef.tree result, it correctly released the Directory snapshot lease and containerd could delete the snapshot, but the git-snapshot metadata entry could remain visible. A later GitRef.tree evaluation then found the stale metadata entry and failed with the removed snapshot key instead of treating the pruned result as a cache miss.

Remove the core remote-git git-snapshot fast path entirely. GitRef.tree is already a persistable dagql field, so dagql is now the only persistent cache authority for this core API path. If the dagql result is still retained, the resolver is not run. If dagql has pruned it, the resolver performs a fresh checkout and creates a new Directory snapshot.

The remote git mirror lock remains, but it now uses a local lock prefix instead of the deleted cacheref helper constants. The separate legacy git source code still has its own git snapshot metadata path; this change is intentionally scoped to the core remote GitRef.tree resolver.

Verified with:

go test ./core -run '^$'
go test ./core/integration -run TestEngine -count=0
dagger --progress=plain call engine-dev test --pkg ./core/integration --run='TestEngine/TestLocalCachePruneRemoteGitSnapshot' --timeout='30m'
dagger --progress=plain call engine-dev test --pkg ./core/integration --run='TestGit/TestCaching' --timeout='30m'

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Remote git tree results can be materialized concurrently by different clients before dagql has a completed equivalent tree result to reuse. Those independent checkouts have identical Git content, but worktree files get checkout-time mtimes. Snapshot diff checks metadata before falling back to content comparison, so equivalent checkouts can report the whole worktree as modified.

Normalize every checked-out path to the Unix epoch before committing the checkout snapshot. Use UtimesNanoAt with AT_SYMLINK_NOFOLLOW so symlink entries are handled directly instead of following their targets. This keeps the change local to checkout materialization and leaves dagql/egraph identity unchanged.

This is intentionally only the worktree timestamp normalization step. With keepGitDir=true, .git metadata can still differ between independent checkouts and needs separate normalization.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Remote git tree checkouts can still differ after normalizing worktree mtimes when keepGitDir=true. The worktree file contents match, but Git's index records checkout-local stat metadata for tracked paths. Independent checkouts of the same ref can therefore produce different .git/index bytes, which shows up as a .git diff even though the repository content is equivalent.

Run `git read-tree HEAD` after checkout and submodule update when the .git directory is retained. This rewrites the top-level index from the HEAD tree without updating worktree files, because we intentionally do not pass `-u`. The logical index still represents HEAD, but the checkout-local stat cache entries are cleared so independently materialized checkouts converge.

This pairs with checkout mtime normalization: read-tree makes the index content deterministic, then the timestamp normalization pass makes the index file metadata deterministic along with the rest of the checkout. The result keeps the fix local to git checkout materialization and avoids reintroducing the separate snapshot metadata cache that was redundant with dagql.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
TestDirectory/TestDiff/equivalent asserts that a Git tree and the same tree copied into an empty Directory compare as equivalent. The test is about source content, but the default Git tree can include the .git directory.

.git is checkout metadata rather than source content, and it can vary across cached checkouts even when the repository contents are identical. Depending on that metadata makes the equivalence assertion sensitive to cache state instead of the directory behavior the regression test is meant to cover.

Request the Git tree with discardGitDir enabled and leave a short comment at the call site. The test still exercises the original directory diff regression, but it no longer treats unstable checkout metadata as part of the expected content comparison.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant