Unreasonably poor performance with MSVC and Ninja #1420

AnyOldName3 · 2024-03-25T15:33:43Z

AnyOldName3
Mar 25, 2024

OpenMW would like to use ccache for our Windows CI builds. We've been using it for other platforms for a long time with great success, and have finally got around to setting it up for MSVC jobs. However, the performance when the cache has not yet been initialised is really poor. Here's some timing data from one of GitLab's Windows runners (a slow two-core VM - I think a Google Cloud n1-standard-2):

	time to process all 1383 files (seconds)	time to process 919 files (seconds)
No ccache	4932.479	2823.617
ccache empty	N/A	6148.046
ccache 99.53% hit	749.482	293.796

When the hit rate is good, the speedup is excellent, but there has to be a successful build to populate the cache first, and because a clean build with ccache takes 2.18 times longer than one without, we can't fit it within the time limit GitLab's CI imposes on us (that's where the 919 files figure comes from - it's what it'd done before the VM was spun down). I managed to coax it through the process by building a quarter of the project at a time, but that's not a viable solution long-term, as cache isn't shared between forks or all branches on a fork.

My understanding is that if everything were working properly, the impact of ccache on a clean build should be around 5-15%, not the 118% I'm seeing. 5-15% would be consistent with the 15.2% cost of the 99.53% hit run compared with a clean non-ccache run, as working out the cache key for a file and copying it if it exists should be about as expensive as working out the cache key for a file and then copying it into the cache because it didn't exist.

One of my jobs to just build part of the project gave the following output for ccache --show-stats -v:

Cache directory:    C:\GitLab-Runner\builds\AnyOldName3\openmw\ccache
Config file:        C:\GitLab-Runner\builds\AnyOldName3\openmw\ccache\ccache.conf
System config file: C:\ProgramData\ccache\ccache.conf
Stats updated:      03/22/24 16:49:42
Cacheable calls:    492 / 492 (100.0%)
  Hits:               0 / 492 ( 0.00%)
    Direct:           0
    Preprocessed:     0
  Misses:           492 / 492 (100.0%)
Successful lookups:
  Direct:             0 / 492 ( 0.00%)
  Preprocessed:       0 / 492 ( 0.00%)
Local storage:
  Cache size (GiB): 0.2 / 5.0 ( 3.47%)
  Files:            984
  Hits:               0 / 492 ( 0.00%)
  Misses:           492 / 492 (100.0%)
  Reads:            984
  Writes:           984

which doesn't immediately explain to me why it was so much slower than expected.

So far, this has all been done using 4.9.1. I don't see any issues or pull requests discussing bad performance with MSVC/Windows that happened since that release.

jrosdahl · 2024-03-25T19:10:27Z

jrosdahl
Mar 25, 2024
Maintainer

My understanding is that if everything were working properly, the impact of ccache on a clean build should be around 5-15%, not the 118%

The 5-15% figure comes from builds on Linux. On Windows the overhead is indeed higher and you shouldn't expect similar performance. I think it's slower mainly because executing new processes are slower, but partly also because file systems tend to be slower.

working out the cache key for a file and copying it if it exists should be about as expensive as working out the cache key for a file and then copying it into the cache because it didn't exist

No, there is a big difference: on a cache hit ccache can find a match by just reading files (the direct mode) but on a cache miss it runs the preprocessor once in addition to executing the compiler. As mentioned above, this extra preprocessor call is more costly on Windows in relative terms. However, since version 4.7 the depend mode is available for Windows as well. I suggest trying it out.

3 replies

AnyOldName3 Mar 25, 2024
Author

Thanks. I've kicked off a build with $env:CCACHE_DEPEND = "true", so hopefully that'll help. I'll have results in a couple of hours. I'm not anticipating many/any hits that only happen after the preprocessor's been applied as on other platforms, we only see that for two files out of over a thousand, so there shouldn't be any meaningful downside if it ends up helping.

AnyOldName3 Mar 25, 2024
Author

That only gave a modest improvement. It brought it down to 5251.321 seconds for the first 919 files, which is still 86% of overhead on top of not using ccache at all, and only cut 15% off the cost of using ccache without that change.

I got some timings from my own desktop to see how similar the behaviour is to GitLab CI:

	time (seconds)
No ccache	372.354
ccache empty	430.645
ccache 99.92% hit rate	22.410 (nearly all of which was linking, so not cached)

That's a 15.7% overhead for a clean build if ccache is enabled, which is what I'd been expecting and hoping for from GitLab. (All these numbers are a bit disappointing as OpenMW built in 142 seconds on this machine a couple of years ago.)

AnyOldName3 Mar 28, 2024
Author

To reiterate, I still believe the fact that a clean build is taking 2.2 times longer with ccache than without on GitLab CI is a fault of some kind, and still need help diagnosing it. On other machines, also running Windows and using MSVC, I'm only seeing the typical 15% performance hit, so I don't think there's anything inherent to the project I'm building or fact that this is happening on Windows.

silverqx · 2024-05-17T18:25:29Z

silverqx
May 17, 2024

Now I checked ccache.exe binary and it looks like it's built using MinGW :/, can you try to compile ccache using msvc 2022 Release build and do the same tests? Compiling ccache is super easy.

Eg.:

cmake.exe `
-S O:\Code\c\ccache\ccache `
-B O:\Code\c\ccache\ccache-builds-cmake\release `
-G Ninja `
-D CMAKE_BUILD_TYPE:BOOL=Release `
-D CMAKE_INSTALL_PREFIX:PATH='O:/Code/c/ccache/_install/release'

cmake --build . --target install

I would be really curious about perf. difference between MinGW and msvc.

0 replies

silverqx · 2024-05-18T14:27:51Z

silverqx
May 18, 2024

I'm gonna spam this theard a little bit, I'm sry in advance 🙃
Everything I'm describing below is about the initial builds when the ccache is empty.

I have very interesting results, I tested it too on my project with ~113 TU and these TU-s are compiling pretty fast what is very important (the quicker the TU-s are compiled, the more accurate the results are between OS-es and different ccache configurations).

The conclusion is that on Windows is ccache slower because of the preprocessor mode.

Filesystem has 0% impact on my results, I tested everything also with Redis using remote-only=true and the differences were 0%, Redis with remote-only=true was even ~1 second slower than the filesystem backend (which can be considered a measurement error).

But what is most interesting is that msvc is 1-4% faster than Clang 18 with lld linker on Linux if the depend_mode=true.

The result column describes how much slower the ccache initial build is compared to the build w/o ccache.
All values with . beside are N/A, they only contain a value from the row above for better visual comparison.

	w/o ccache	ccache initial	ccache 100% hits	result
msvc w/o PCH	51s	61-63s	5s	16.3-19%
msvc with PCH	27s	38s	9s	29%
msvc PCH redis remote-only=false	27s .	42s	9s	35.7%
msvc PCH redis remote-only=true	27s .	39s	9s	30.7%
msvc PCH depend_mode=true	27s .	30s	9s	10%
linux w/o PCH	58s	61s	1s	5%
linux PCH	33s	38s	2s	14%
linux PCH depend_mode=true	33s .	37-38s	2s	11-14%

msvc 2022 17.9.7 (ccache folder on 980pro nvme pcie4) and Linux build was Clang 18 with lld linker (ccache folder on 850evo). Build system qmake.

Results

Restoring from ccache is ~7s slower on windows. (bottleneck) 🤔
Initial ccache build with PCH is 38s on both OS-es (very interesting).
Initial ccache build w/o PCH is very similar on both OS-es.
depend_mode=true is 1-4% faster on msvc. (surprise) 🥳
Secondary Redis storage backend (remote-only=false) add 3s to the initial ccache build.
remote-only=true is 1s slower than filesystem backend (on msvc).
Preprocessor mode adds 1-4% to initial build on Linux (with PCH).
Preprocessor mode adds 20% to initial build on msvc win (with PCH). (bottleneck).

Would be interesting to know how much spawning new processes affects these bottlenecks. At sure spawning a new process is slow on Windows. Instead of just spawning cl.exe, is the ccache.exe spawn, which then spawns cl.exe so 2 spawn processes instead of 1.
Can this make these 7s?

Question of the day is: Is the msvc preprocessor really that slow? (that is adds 20% to the build?)

5 replies

AnyOldName3 May 20, 2024
Author

These timings are much more sensible than the ones I'm seeing on GitLab CI. You're getting a 29% overhead on the initial build by default, and 10% with depend_mode=true, whereas I was seeing a 118% percent overhead by default (i.e. it took 2.18x longer than without ccache) and 86% (i.e. 1.86x longer) with depend mode.

I could live with the timings you're seeing, but what I'm getting is way worse, hence saying it's unreasonable.

silverqx May 20, 2024

whereas I was seeing a 118% percent overhead by default (i.e. it took 2.18x longer than without ccache) and 86% (i.e. 1.86x longer) with depend mode

that's a lot, it have to be caused by slow SSD or HDD, if this is not the case on you own machines but only on Google Cloud then there must be slow HW.

I discovered that ccache supports inode_cache on Windows and it improves performance a lot, it's faster than on Linux in many cases.

I also found that spawning new processes isn't a problem, spawning 115 ccache processes takes 600ms. And preprocessor mode on msvc is really slow, it adds 500ms even to TU that normally compiles 80ms and that is a lot. Restoring is slower because inode_cache is disabled by default, if I enable it it improves performance rapidly and qmake is slower than cmake that is another penalty.

In your case I would try to compile on own HW to see if the problem is GCloud or somewhere else.

AnyOldName3 May 20, 2024
Author

When I build on my own hardware, with the source and output directories on an NVME SSD, and the cache on a mechanical drive, it added 15% overhead. I've done some digging and think, but am in no way sure, that the VM type GitLab uses has pd-standard storage, and I think https://cloud.google.com/compute/docs/disks/performance#performance_limits implies it's networked storage, which could mean it's adding a lot of latency. Maybe passing -j with extra jobs instead of letting Ninja default to the number of cores could counteract this and keep the cores fed with work.

silverqx May 20, 2024

I think you found the reason why it's slower on gcloud, network storage has to have some perf. penalty while ccache checks ctime/mtime and computes hashes for 100 or maybe even 1000 of #include-s.

silverqx May 20, 2024

I would try this, I discovered it today: ignore_headers_in_manifest = ${INCLUDE};E:/Qt;O:/Code/c_libs/vcpkg;E:/c_libs/vcpkg

silverqx · 2024-05-20T11:14:02Z

silverqx
May 20, 2024

I tried to compile openmw myself and you can be pretty sure that the problem is in gcloud here are my results:

openmw with PCH Debug configuration

5:04 initial build w/o ccache
5:11 depend_mode=true inode_cache=true initial ccache build
47s 100% ccache hits

This is how I configured:

cmake .. -DMYGUI_RENDERSYSTEM=1 -DMYGUI_BUILD_DEMOS=OFF -DMYGUI_BUILD_TOOLS=OFF -DMYGUI_BUILD_PLUGINS=OFF -DCMAKE_INSTALL_PREFIX='E:/tmp/openmw_debug_1' -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DPRECOMPILE_HEADERS_WITH_MSVC=ON
cmake --build . --config Debug

Initial ccache overhead 4% which is amazing.

0 replies

silverqx · 2024-05-22T13:08:58Z

silverqx
May 22, 2024

I want to tell that I'm also using ccache on my CI pipelines, I suppose that the ccache is much slower on these runners and I never examined how much slower it is as I'm expecting that it will be slower. I got to the point when compilation took so long and used so many hours on these GA runners that I had to migrate to self-hosted runners which are hosted on my dev. machine and it was a great move. I recommend migrating to self-hosted also for your project, the benefits are huge. Can look my pipelines here 75% of them are self-hosted and I can invoke all of them as needed using eg. GitHub UI or using gh workflow run --ref <branch> command, or of course through merges to main, ...

1 reply

AnyOldName3 May 22, 2024
Author

GitLab (we don't use GitHub) gives OpenMW free VM time, but not free physical machines. We get through a lot of pipelines as we've got lots of contributors who get through a lot of stuff. Looking at your project, we're dealing with at least an order of magnitude or two more builds, so we'd need a fleet of machines to self-host enough runners to cope without GitLab's VMs, and we'd still need pipelines in forks that didn't have access to any of the org's self-hosted runners to work, so it wouldn't avoid the need to bootstrap the cache within the job time limit on the VMs.

It is something we've considered as a means to get pipeline latency down, but it just wouldn't fix all the issues.

AnyOldName3 · 2024-05-23T22:04:41Z

AnyOldName3
May 23, 2024
Author

I've got some new data. GitLab finally changed their base image from MSVC2019 to 2022, so I had to time everything again. None of these builds actually finished in the time limit, so for comparisons, I'll use the number of translation units that every job managed to compile, which was 1030. As I'm using the 4.9.1 release, I can't try the inode cache, and it yells if precompiled headers are used (I guess we must use a flag that's only supported since 4.9.1 landed), so they're off for the ccache tests. All these tests are with Ninja. I wouldn't read too much into small differences as there's noise in the build times.

	Total time before the job timed out (s)	TUs finished	Total TUs	Time for 1030 TUs (s)	Overhead vs baseline
No CCache, with PCH	6011.213	1378	1403	3090.955	I'm pretty sure the 24 extra files are PCH-related and so at the start of the build, so this comparison wouldn't be fair
No CCache	6146.528	1334	1381	4293.945	N/A
CCache untweaked	6082.221	1030	1381	6082.221	41.6%
Depend mode	5943.291	1049	1381	5584.860	30.1%
Depend mode, ignore dependency headers	6022.879	1073	1381	5378.108	25.2%
Depend mode, ignore dependency headers, explicit `-j` with one more job than cores	6082.790	1089	1381	5341.426	24.4%

So it's fair to say that the newer VM images:

Manage to process more TUs at the start of the build in less time with or without ccache, but slow down more as the build goes on. (They 919 TU numbers are way better, but the whole-build ones are worse.)
Have way less overhead from CCache with or without the tweaks.

If I'd seen these numbers before, I'd probably have just given up instead of bringing it here as a potential bug, which is a big step in a good direction, even if nothing's changed on CCache's end.

I'm still left in an annoying situation as if I can't get a build with Ninja to finish in the time limit at all (despite being able to with MSBuild, which should be slower), I'm not going to be able to do it with CCache's initial build overhead. Hopefully if I wait for a new release to happen so I can also use inode cache, and do something on my end so I can get Ninja builds to not be mysteriously slow without CCache, this will become viable.

0 replies

silverqx · 2024-05-24T14:20:35Z

silverqx
May 24, 2024

You need to debug where the time is spent, I did the same for my build, I did it using debug=1 debug_level=1, something like this:

measure-Command { ccache debug=true debug_level=1 disable=false depend_mode=false recache=true inode_cache=true cl -c -nologo -Zc:wchar_t -FS -Zc:rvalueCast <...all other build options...> -Fodebug\ O:\Code\c\qMedia\TinyORM\TinyORM\src\orm\basegrammar.cpp }

Then I opened xyz.ccache-log and checked the timestamp column, it contains nanoseconds! eg.:

[2024-05-09T13:13:23.194609 15664] Got result key from preprocessor
[2024-05-09T13:13:23.194630 15664] Result key: 5130ag81vu36qdnj50vnhgb5im4dnvns8
[2024-05-09T13:13:23.194858 15664] No 5130ag81vu36qdnj50vnhgb5im4dnvns8 in local storage
[2024-05-09T13:13:23.194879 15664] Running real compiler
[2024-05-09T13:13:23.196709 15664] Executing "C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.39.33519\\bin\\HostX64\\x64\\cl.exe" -nologo ... basegrammar.cpp
[2024-05-09T13:13:23.598555 15664] Using Zstandard with default compression level 1
[2024-05-09T13:13:23.598806 15664] Storing embedded entry #0 <stdout> (17 bytes)
[2024-05-09T13:13:23.598933 15664] Storing embedded entry #1 .o (525210 bytes) from

Here we can see that Executing "C:\\Program Files\\Microsoft Visual ... took ~400ms.

But debug this on CI is even harder, you would need to ssh into run and invoke this compile commands manually and examine where it spends most time, problem or bottleneck can be anywhere.

I don't think ccache is doing something wrong, I spent ~1week in ccache code and event it's patched all around because of different compilers (code wasn't designed and correctly refactored for clang-cl, msvc, ...) but still the code has a very good perf. quality, eg. no regex-es (what is great perf. boost) and it's written pretty good from the perf. perspective. But I didn't dive deep into caching and retrieving logic enough.

Do perf. tunning would be a good idea, what means enable valgrind and examine results in kcachegrind, but this is for linux and on windows are of course invoked other code branches (I'm also doing this from time to time for my library). On windows isn't that easy.

You need to think about better machines that are compiling, you can't expect from the free gcloud instances to be performant enough to compile 1400 TU quickly, they aren't designed for this.
All possible solutions will be painful they will cost time and effort, or you can simply pay high perfomant gcloud instances but I don't believe you would get what you pay.

It would be interesting to know where the bottleneck is in your case, but I won't even dare to guess what it could be exactly.

If you want to here is compiled the latest ccache master branch that contains inode_cache, it's a Release build.
ccache-4.9.1-windows-x86_64.zip

2 replies

AnyOldName3 May 25, 2024
Author

I think 25% overhead is fine even if it's not great. It was the original 118% overhead I was getting that was ludicrous, and now the old VM images are gone, I don't have the ability to reproduce it anymore.

OpenMW's CI infrastructure isn't in need of major changes. It's just irritating that the overwhelming majority of our Windows jobs do at least eighty minutes of compiling that CCache could totally elide, and we got that benefit everywhere except Windows with no effort whatsoever, but every time we try on Windows, we hit a new hurdle. The current one is that Ninja builds have gone from marginally faster than MSBuild to way slower, and that's without CCache, so not something I need to waste your time with.

AnyOldName3 May 25, 2024
Author

Anyway, thanks very much for your help. At least I have a much better understanding of what knobs I can tweak now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unreasonably poor performance with MSVC and Ninja #1420

{{title}}

Replies: 7 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unreasonably poor performance with MSVC and Ninja #1420

Replies: 7 comments · 11 replies

jrosdahl Mar 25, 2024 Maintainer

AnyOldName3 Mar 25, 2024 Author

AnyOldName3 Mar 25, 2024 Author

AnyOldName3 Mar 28, 2024 Author

Results

AnyOldName3 May 20, 2024 Author

AnyOldName3 May 20, 2024 Author

AnyOldName3 May 22, 2024 Author

AnyOldName3 May 23, 2024 Author

AnyOldName3 May 25, 2024 Author

AnyOldName3 May 25, 2024 Author

Replies: 7 comments 11 replies

jrosdahl
Mar 25, 2024
Maintainer

AnyOldName3 Mar 25, 2024
Author

AnyOldName3 Mar 25, 2024
Author

AnyOldName3 Mar 28, 2024
Author

AnyOldName3 May 20, 2024
Author

AnyOldName3 May 20, 2024
Author

AnyOldName3 May 22, 2024
Author

AnyOldName3
May 23, 2024
Author

AnyOldName3 May 25, 2024
Author

AnyOldName3 May 25, 2024
Author