-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double-execution of repo-scanning functions #92
Conversation
thunk.onComplete { attempt => | ||
val fut = thunk // evaluate thunk, evaluate only once! | ||
fut.onComplete { attempt => | ||
val elapsedMs = System.currentTimeMillis() - start | ||
log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}") | ||
} | ||
thunk | ||
fut |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key part of the fix - we just evaluate thunk
once, rather than twice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
import scala.concurrent.{Await, ExecutionContext, Future} | ||
|
||
class DogpileSpec extends AnyFlatSpec with Matchers { | ||
it should "not concurrently execute the side-effecting function" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this test while trying to debug this issue. It never reproduced the problem - because the problem wasn't in Dogpile
- and due to the unpredictable nature of concurrency problems, the test wouldn't have been guaranteed to spot one if there was one - but I guess it's a reasonable statement of intent.
#90 introduced additional logging, providing a `logAround()` method that timed the execution of `Future`s. However, it contained a bug! The `thunk` was passed as a 'by-name' parameter (see https://docs.scala-lang.org/tour/by-name-parameters.html) so that it wouldn't start executing until we were ready to start timing it, which is reasonable. 'by-name' parameters are evaluated *every* time they are used though, and the `logAround()` method evaluated it _twice_. So the thunk was executed twice, concurrently, when it was just supposed to be executed once. You can see in the logs below that the 3 pieces of code timed with `logAround()` were executed twice: ``` Jan 31 15:50:48 prout-bot app/web.1 [info] controllers.Api - githubHook repo=guardian/frontend githubDeliveryGuid=Some(0789db88-a17f-11ed-9d54-6d035aa39ad1) xRequestId=Some(78c70f90-622e-4bb0-8d53-c70b0727b4b0) Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393) Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch repo hooks' 196 ms : success=true Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393) Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch git repo' 341 ms : success=true Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25 Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749)) Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch PRs' 6160 ms : success=true Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25 Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749)) Jan 31 15:50:58 prout-bot app/web.1 [info] l.RepoLevelDetails - Need to look at guardian/frontend, branch:main commit AnyObjectId[2bddf3a5f95129cf745eb7843b71ce9f8782eeca] ``` Fetching repo PRs and hooks through GitHub API calls can be duplicated without much issue (apart from perhaps doubling API quota consumed), but fetching the git repo itself (cloning/fetching) happens on a fixed folder on the filesystem, and having simultaneous threads trying to write to that folder would often lead to exceptions, trying to lock those files - here are two examples: ``` Jan 30 12:00:01 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/members-data-api - Git Repo ref count: Failure(org.eclipse.jgit.api.errors.TransportException: lock error: /tmp/bot/working-dir/guardian/members-data-api/repo.git/shallow) ``` ``` Jan 30 12:44:10 prout-bot app/web.1 Caused by: org.eclipse.jgit.errors.LockFailedException: Cannot lock /tmp/bot/working-dir/guardian/prout/repo.git/config. Ensure that no other process has an open file handle on the lock file /tmp/bot/working-dir/guardian/prout/repo.git/config.lock, then you may delete the lock file and retry. Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.storage.file.FileBasedConfig.save(FileBasedConfig.java:185) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:303) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:191) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:1) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.invoke$1(RepoUtil.scala:39) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.getUpToDateRepo$1(RepoUtil.scala:61) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.getGitRepo(RepoUtil.scala:68) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoSnapshot$Factory.$anonfun$fetchLatestCopyOfGitRepo$1(RepoSnapshot.scala:121) ``` Sentry does a reasonable job of showing that these errors only started with PR #90 (looking at the 'First Seen' of 'Jan 26, 5:57 PM'): https://sentry.io/organizations/the-guardian/issues/3899449647/?project=49913&query=is%3Aunresolved&referrer=issue-stream
183b128
to
9dc0980
Compare
@@ -46,7 +46,8 @@ class Dogpile[R](thing: => Future[R]) { | |||
* | |||
* @return a future for a run which has been initiated at or after this call | |||
*/ | |||
def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { previousState => | |||
def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { // TODO updateAndGet shouldn't handle side-effects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is concerning to me, but doesn't seem to be currently causing severe problems: the docs for AtomicReference. updateAndGet()
say:
The function should be side-effect-free, since it may be re-applied when attempted updates fail due to contention among threads.
I only realised this while trying to hunt down the cause for double-executions just now, and thing is that the function we're using here is not side-effect-free - it's very very side-effecty, labelling repos, creating GitHub comments, etc.
This code should be updated in another PR to be truly safe, and run the side-effecting code with the desired semantics, which are closer to 'throttle-last'.
val allF = Future.traverse(1 to numExecutions)(_ => dogpile.doAtLeastOneMore()) | ||
Await.ready(allF, 15.seconds) | ||
|
||
executionCount.intValue() should be <= numExecutions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in this test, what I would hope is that the execution count is more like 2, rather than 20, but due to the behaviour of AtomicReference.updateAndGet()
we're currently getting the full 20.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix and write-up! Thanks! I learned a lot
Thanks! Ok, merging now... |
Due to flakey Snyk (see guardian/.github#43 !) this didn't get deployed after merge: Manually deploying now... |
#90 introduced additional logging, providing a
logAround()
method that timed the execution ofFuture
s:prout/app/lib/RepoSnapshot.scala
Lines 60 to 67 in 8ff1519
However, it contained a bug! The
thunk
(the repo-scanning function) was passed as a 'by-name' parameter, with the intention that it wouldn't start executing until we were ready to start timing it, which is good. But 'by-name' parameters are evaluated every time they are used, and thelogAround()
method evaluated it twice (on line 62 and 66). So the thunk was executed twice, concurrently, when it was just supposed to be executed once.There were 3 repo-scanning tasks affected by this:
prout/app/lib/RepoSnapshot.scala
Lines 78 to 80 in 8ff1519
You can see in the logs below that the 3 pieces of code timed with
logAround()
were executed twice:Of those 3 tasks, fetching repo PRs and hooks through GitHub API calls can be duplicated without much issue (apart from perhaps doubling API quota consumed), but fetching the git repo itself (cloning/fetching) happens on a fixed folder on the filesystem, and having simultaneous threads trying to write to that folder would often lead to exceptions, trying to lock those files - here are two examples of errors:
Sentry does a reasonable job of showing that these errors only started with PR #90 (looking at the 'First Seen' of 'Jan 26, 5:57 PM').