Skip to content

Conversation

@Keno
Copy link
Collaborator

@Keno Keno commented Aug 4, 2025

We have an intermittent Revise test failure in Julia base CI, e.g. https://buildkite.com/organizations/julialang/pipelines/julia-master/builds/49637/jobs/0198728e-c6d1-4db8-8dd8-84eb87f72c6d/log

7-element Vector{Tuple{String, String}}:
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./build_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/build_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./version_git.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/version_git.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./pcre_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/pcre_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./errno_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/errno_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./uv_constants.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/uv_constants.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./file_constants.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/file_constants.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/features_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/features_h.jl")

The apparent cause is that the Revise test job runs on the same node that built the original binary. Upon further investigation, that cause is apparent - when setting juliadir, we only check whether the original build dir exists, not whether it actually belongs to the julia executable we're testing. On the buildbots, they will happily accept the next build job, causing the path to exist again (in the same location), but actually we're not testing a source build. Fix this by always trying to use the canonical build directory (which should be created by the build system), even in a source build. If it doesn't exist (e.g. because the build systme ran incompletely), we try to fallback to the original directory. If an only if Sys.BINDIR points to where it would in an ordinary source build. Lastly we fall back to the ascending search procedure, though I'm a bit skeptical about it, since it could easily find the wrong installation. I added a warning for users to let us know if they encounter this frequently.

We have an intermittent Revise test failure in Julia base CI, e.g.
https://buildkite.com/organizations/julialang/pipelines/julia-master/builds/49637/jobs/0198728e-c6d1-4db8-8dd8-84eb87f72c6d/log
```
7-element Vector{Tuple{String, String}}:
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./build_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/build_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./version_git.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/version_git.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./pcre_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/pcre_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./errno_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/errno_h.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./uv_constants.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/uv_constants.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/./file_constants.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/file_constants.jl")
 ("/cache/build/builder-amdci5-3/julialang/julia-master/base/features_h.jl", "/cache/build/builder-amdci5-3/julialang/julia-master/base/features_h.jl")
```

The apparent cause is that the Revise test job runs on the same
node that built the original binary. Upon further investigation,
that cause is apparent - when setting `juliadir`, we only check
whether the original build dir exists, not whether it actually
belongs to the julia executable we're testing. On the buildbots,
they will happily accept the next build job, causing the path
to exist again (in the same location), but actually we're not
testing a source build. Fix this by always trying to use the
canonical build directory (which should be created by the build
system), even in a source build. If it doesn't exist (e.g.
because the build systme ran incompletely), we try to fallback
to the original directory. If an only if `Sys.BINDIR` points
to where it would in an ordinary source build. Lastly we fall
back to the ascending search procedure, though I'm a bit
skeptical about it, since it could easily find the wrong
installation. I added a warning for users to let us know if
they encounter this frequently.
@Keno Keno requested a review from timholy August 5, 2025 07:19
@timholy
Copy link
Owner

timholy commented Aug 5, 2025

Specifically for the julia buildbots, I suspect the safest option would be to signal "intent" (circumstances) with an ENV flag or something that Revise can check to see where it should expect to find things, and just refuse to proceed if things don't work out as expected.

For the fallback behavior on user systems, it's possible that a lot of this code predates the path cleanups that have occurred in more recent years. (The last time it was touched was #698 but even that largely refactored older code.) I haven't looked into this in quite a while, but if the current Julia lower bound of 1.10 allows us to be safer about how we find the cache of source text, then I'd be all in favor. We have to make it work on any supported version of Julia, but here too the best approach is to eliminate guessing by Revise and have Julia document it correctly.

@Keno
Copy link
Collaborator Author

Keno commented Aug 5, 2025

I don't really like buildbot-only behavior. There is a correct answer here, which is to use the DATAROOTDIR. We symlink it on source builds, so it'll have the correct behavior. We already don't have any kind of fallback for the cache file. The only possible situation here where it makes sense to try a different path is if there was some incomplete build that failed to set up the symlinks. I'd actually be ok with that failing with an error, but since it used to work, might as well keep it. If someone has a weird config out there than needs a fallback, the warning will tell them to let us know. I can't think of any, but the universe of configurations is large, so maybe there is one. If so we can evaluate what to do (possibly the fix needs to be in base). If nobody complains in the next year or so, we can remove the warning and the fallback.

@Keno Keno merged commit 95ce024 into master Aug 5, 2025
7 of 13 checks passed
@aviatesk aviatesk deleted the kf/intermittentcifix branch August 5, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants