-
Notifications
You must be signed in to change notification settings - Fork 83
CacheLockException: Failed to acquire lock #251
Comments
I've just taken a look at our CI builds and seen that our build time jumps from 7 to 23 minutes when enabling clcache and there are no hits. If the build is fully cached It's roughly equal to not using clcache. Sadly, this is too much of a performance impact and forces us to disable clcache. If you need more data to track down these problems I could occasionally enable clcache for testing. |
Clearly, a factor three slowdown when using clcache with a cold cache is unacceptable, but it's also a hint that something is not working as intended in your specific use case - I'd love to learn what's going on! And indeed, I agree that even the default timeout is very large (I never needed a timeout that large for my own builds). For what it's worth, synchronising concurrent accesses to the cache is implemented by dividing the cache into 256 'sections' and then maintaining a lock for each section. The 'name' of a section is just the first two characters of the hash sum computed for the invocation, a task executed by the In your error message it's visible that the lock being acquired (this is implemented in terms of the
...which means that the path being locked is |
Okay, so the chance of not aquiring the same lock when executing exactly |
Indeed, for 10 concurrent calls, there is a ~17% chance that at least two of them access the same of 256 uniformly distributed sections. However, each section should only be held for a very short time (typically in the order of < 150ms). In particular, note that no lock is held while the compiler runs. The relevant code here is the If you find that the default timeout of 10 seconds does not help with 8 concurrent compilers, then I suspect there is something else fishy. My gut feeling is that there may be a deadlock due to a wrong order in which the locks are acquired and released (though I could not find any such place in the code after a review), or that there is some lock handle not being released properly, or maybe some operation which is performed under the lock is a lot slower for you than expected. You don't happen to have the cache directory on a network drive or some aggressive virus scanner which tracks file accesses, do you? :-) |
In case anybody would be willing to somehow get me access to a Windows machine which can run eight or more jobs in parallel, that would be great. I always have a hard time testing this as the only machines I have access to are dual cores. I hoped that maybe Microsoft Azure features something simple for the sake of developing or testing open source projects, but couldn't find anything. |
Good remark. Our virus scanner was indeed configured rather agressively. Disabling this behavior improves runtime when using clcache. I am currently running some benchmarks (also regarding the caching of header files) and will report back soon. |
Interesting -- what virus scanner is that (and what setting did you use)? This might be something worth adding to the Caveats section of the Wiki. |
Even though this is probably me being overly precautious I would prefer not to talk about our security in public. :) It should generally help to discourage virus scanners from inspecting object files upon being touched by |
Did reconfiguring your anti virus software resolve the cache lock exceptions? |
@TiloW ping? I wonder what to do about this ticket. :-) |
Can't give any new info yet. I'll have to do some more testing and will report back. |
I've run some tests but didn't come up with any issues. Seems I've got some more testing to do. |
Happened again, might be related to un-setting |
Same problem here on a Dual-Xeon machine with 8 (real) cores and msbuild running with /m:8, even with CLCACHE_OBJECT_CACHE_TIMEOUT_MS set to 120000. It usually happens when cache is cold, which I guess makes sense. Virus scanner is disabled for all clcache directories. Unfortunately these are internal build machines, so I cannot give access to them. Is there some way that I could patch clcache to output which other process is currently holding the lock? |
Indeed, there was a branch at some point which introduced some additional logging specifically for the locking. Alas, I cannot find it anymore and now I regret not having integrated that as a real feature. 😢 A quick patch might be to introduce calls to the I'll hopefully get around to building a patch which implements this idea soon, unless you beat me to it. |
@dengste @TiloW I now got around to adding a patch which adds some fine-grained debug output which hopefully helps with narrowing down locking issues. It's in the |
Thanks! I've deployed this version to our build slaves and I see lots of mutex-related debug output. I guess I'll also clear the cache regularly to provoke the error. |
@frerich good job!
Will do! |
I reduced the timeout again to 10s, which I think is already plenty, and I got the error. I cleaned up the log and reduced it to the one problematic compiler run, which has the id '59': https://gist.github.com/dengste/763fb957a6806e984ac6bf556cde5629 As you can see, the thread TID=3648 goes wild an acquires locks for all manifests and objects and holds them for about 30seconds. |
@dengste Thanks! I think what's happening is that the automatic cleaning of the cache (in case the cache exceeds the defined maximum size) is taking very long (maybe too long), triggering this. TID 3548 is the main thread of process 7632. The invocation builds 18 C++ files and thus launches 18 threads to handle the individual source files concurrently (see the After all jobs finished, If this theory is sound, then I believe that
Furthermore, if this theory makes sense, then it would be good to figure out why |
Ooh, it's the cache cleanup. That makes sense, the object cache is indeed close to 1GB. I completely forgot about that. As for why it is taking so long: First and foremost, the I/O on those servers is heavily targeted towards reliabilty, not speed, meaning we're dealing with RAIDs that either have no write cache at all, or at least a pretty small one that is regularly flushed. I assume that cleaning up the cache involves a lot of 'stat' calls on small files in a huge directory tree, which is pretty much a worst-case scenario for NTFS and the underlying hardware. Since I have lots of HD space I will probably just increase the cache size and maybe once in a month clean the cache while the servers are idle. Let's see if the problem is gone then. |
@frerich I haven't seen the error since, so this definitely fixed it for me. I still had some problems that clcache.exe would sometimes return with error code -1, but without any visible error message. I suspect that this is some problem with the exe generated by pyinstaller and the different Python versions we have installed on our build servers, so I went back to use a simple batch file that directly calls python on clcache.py, and now everything works great! Thanks a lot for your help! |
@dengste Thanks for the positive feedback! I wonder how to 'fix' this issue... I'm not sure I can avoid locking the entirety of the cache for the sake of cleaning. I suppose one improvement would be to yield better diagnostics, or to use a larger timeout in case it is detected that a cache cleansing is in process. I could also add an entry to the Caveats page. What do you think? Seeing that the |
Not sure how to really fix this. Maybe you could just disable the timeout while the cache is being cleaned? But then I would suggest to reduce the cache size on each clean by much more than just 10% - more like 50%, otherwise you'd get long hangs during compile pretty frequently. Maybe make it configurable? |
Hi, |
I recently hit this problem on Appveyor (only 2 threads, I think this could be solved in this way: pruning process should signal other processes that he is working (alive and progressing) and the waiters should not timeout on lock until prunner is dead. |
@frerich Can we just fall back to real compiler as temporary workaround? |
@ivaigult I'm not sure, do you have a specific patch in mind? |
Does anyone have a solution or a workaround for this bug? |
Maybe the cache could be cleaned locking each object and manifest section independently. It could do a first pass collecting which files should be removed and a second pass acquiring the locks + removing the files. I'm am not sure if the first pass would need to acquire all the locks. It would not provide the same guarantees as the current implementation which ensures that the correct files are removed atomically to other clcache processes. This is probably not too important, if the cached files change between the first and the second pass it will do a suboptimal cleanup but it would not corrupt the cache. This strategy could work if removing the files is what takes most of the time. |
Another crazy idea: when cleaning the cache, we rename the cache directory to a temporary name, then perform the cleaning in that temporary location, and then merge whatever cache entries concurrent clcache invocations may have created into the cleansed repository while the cache was cleaned. This might have a slightly negative impact on the cache hit rate (since during the cleaning, the cache appears to be empty, so objects get recreated), but it would probably decouple the clcache instances quite nicely. I'm not sure how easy it is to merge two caches though. |
I am seeing this one pretty often now:
Actually I had single builds for Nuitka where it happened for 8 files. Now, I am getting these spuriously in my CI, but not just occasionally. My CI is often running all day, and it happens multiple times. I have seen instances where 6 calls in the same compilation of Scons failed all with that error, which felt pretty remarkable, all for different hash codes. Nuitka means to use a I was reading about a daemon of clcache, and wondering if it makes the locking problem go away, I didn't immediately employ it, because of additional setup. Do you think that is going to avoid the issue? |
@kayhayen The |
@frerich I have a suspect that the clcache default folder that I use lives outside of what my Buildbot worker uses, and that Windows Defender is causing the timeouts. I have add similar issues with other files and attempts to unlink them. I will now set the environment variable to point inside the already ignored folder. Even if only to clean things up in terms of that user directory folder, my hope is to avoid contention with Windows Defender. |
Can we have CLCACHE_DISABLE_AUTOCLEAN ? We can run clcache -c before build, and if cache grows a bit over 100%, who cares. It is much better then crash during build. I will test commits more following days. Also if would have clcache -c 70 to reduce cache size to other then 90% we could easly avoid cleaning during build. |
I have created 3 commits that implement my above ideas, feel free to use it:
|
CLCACHE_DIR --> %LOCALAPPDATA%\clcache (e.g.'C:\\Users\\jshi19\\AppData\\Local\\clcache') details see: frerich#342 CLCACHE_OBJECT_CACHE_TIMEOUT_MS --> 120s details see: frerich#251
We run into this issue almost immediately after enabling clcache in all our CI builds instead of select few. We have max cache size about 16Gb with a single clean build producing about 4Gb of object files. The cache folder is reused between building on several branches. I am not sure, but if the hypothesis about long cache cleanup is true, then I don't understand why it didn't happen even once when I had max cache size standard at 1Gb and the object files total size was even larger. |
Might be of interest:
In Could it be that the lock acquiring fails due failing to process data in said excessive json in time? |
Okay, so it seems that our issue is more related to #347. We managed to mitigate the problem by increasing timeout up to I believe that clcache accumulates too many additional entries for the maniphest file (corresponding to command line) each time one of "dependency headers" of the compiled file changes. It also seems that there is some inefficiency due to hash being recalculated multiple time for the same headers mentioned in different entries:
|
EDIT
Now, this is not what happens in my case, for me the lock is for stats.txt:
Past 3-5 fails so far and its always the stats.txt file and lock is being acquired @ https://github.com/rasjani/clcache/blob/alt-release/clcache/__main__.py#L1867 Our build is driven with Ninja so there's heavy parallerism. Stats are not that important - would it make sense (as a hack) to just ignore the lock to update the stats and maybe just issue a warning the stats are out of date ? Essentially, i dont need this sort of change to land to this repo, i can just add it to my own fork and wait for proper fixes if and when those might come. |
@rasjani Do you use |
@Kojoley - been running with the server enabled for few weeks and still seeing random cachelock exceptions.. However, last 3 where all about updating stats.txt .. |
Here's some general info about our system:
We are using this machine for CI.
Build duration is usually 5 to 8 minutes and mostly spent on compilation.
Every build triggers up to 4 compiler calls in parallel and there can be two parallel builds (up to 8 concurrent compiler calls).
Since trying out clcache for a few days we have disabled it again because of cache errors still existing after #244.
Here's the error message we are receiving every now and then:
As suggested in #229 I've upped the
CLCACHE_OBJECT_CACHE_TIMEOUT_MS
to 120.000 ms which seems to cause our builds to succeed. I've noted that the error seems to only occur when there are cache misses.Our cache is pretty much filled.
I've re-enabled
clcache
in our CI now and will report if any errors arise withCLCACHE_OBJECT_CACHE_TIMEOUT_MS=120000
.I don't know how exactly your locking mechanism works but already a default timeout of 10 seconds seems extremly high to me. You stated that "it might be the (unlikely) case that multiple clcache invocations are accessing the same manifest section". How is this section determined?
The text was updated successfully, but these errors were encountered: