Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double-execution of repo-scanning functions #92

Merged
merged 1 commit into from
Feb 1, 2023

Conversation

rtyley
Copy link
Member

@rtyley rtyley commented Feb 1, 2023

#90 introduced additional logging, providing a logAround() method that timed the execution of Futures:

def logAround[T](desc: String)(thunk: => Future[T])(implicit repo: Repo): Future[T] = {
val start = System.currentTimeMillis()
thunk.onComplete { attempt =>
val elapsedMs = System.currentTimeMillis() - start
log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}")
}
thunk
}

However, it contained a bug! The thunk (the repo-scanning function) was passed as a 'by-name' parameter, with the intention that it wouldn't start executing until we were ready to start timing it, which is good. But 'by-name' parameters are evaluated every time they are used, and the logAround() method evaluated it twice (on line 62 and 66). So the thunk was executed twice, concurrently, when it was just supposed to be executed once.

There were 3 repo-scanning tasks affected by this:

val mergedPullRequestsF = logAround("fetch PRs")(fetchMergedPullRequests())
val hooksF = logAround("fetch repo hooks")(fetchRepoHooks())
val gitRepoF = logAround("fetch git repo")(fetchLatestCopyOfGitRepo())

You can see in the logs below that the 3 pieces of code timed with logAround() were executed twice:

Jan 31 15:50:48 prout-bot app/web.1 [info] controllers.Api - githubHook repo=guardian/frontend githubDeliveryGuid=Some(0789db88-a17f-11ed-9d54-6d035aa39ad1) xRequestId=Some(78c70f90-622e-4bb0-8d53-c70b0727b4b0)
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch repo hooks' 196 ms : success=true
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch git repo' 341 ms : success=true
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch PRs' 6160 ms : success=true
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:58 prout-bot app/web.1 [info] l.RepoLevelDetails - Need to look at guardian/frontend, branch:main commit AnyObjectId[2bddf3a5f95129cf745eb7843b71ce9f8782eeca]

Of those 3 tasks, fetching repo PRs and hooks through GitHub API calls can be duplicated without much issue (apart from perhaps doubling API quota consumed), but fetching the git repo itself (cloning/fetching) happens on a fixed folder on the filesystem, and having simultaneous threads trying to write to that folder would often lead to exceptions, trying to lock those files - here are two examples of errors:

Jan 30 12:00:01 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/members-data-api - Git Repo ref count: Failure(org.eclipse.jgit.api.errors.TransportException: lock error: /tmp/bot/working-dir/guardian/members-data-api/repo.git/shallow)
Jan 30 12:44:10 prout-bot app/web.1 Caused by: org.eclipse.jgit.errors.LockFailedException: Cannot lock /tmp/bot/working-dir/guardian/prout/repo.git/config. Ensure that no other process has an open file handle on the lock file /tmp/bot/working-dir/guardian/prout/repo.git/config.lock, then you may delete the lock file and retry.
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.storage.file.FileBasedConfig.save(FileBasedConfig.java:185)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:303)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:191)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:1)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.invoke$1(RepoUtil.scala:39)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getUpToDateRepo$1(RepoUtil.scala:61)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getGitRepo(RepoUtil.scala:68)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoSnapshot$Factory.$anonfun$fetchLatestCopyOfGitRepo$1(RepoSnapshot.scala:121)

Sentry does a reasonable job of showing that these errors only started with PR #90 (looking at the 'First Seen' of 'Jan 26, 5:57 PM').

Comment on lines -62 to +67
thunk.onComplete { attempt =>
val fut = thunk // evaluate thunk, evaluate only once!
fut.onComplete { attempt =>
val elapsedMs = System.currentTimeMillis() - start
log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}")
}
thunk
fut
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key part of the fix - we just evaluate thunk once, rather than twice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

import scala.concurrent.{Await, ExecutionContext, Future}

class DogpileSpec extends AnyFlatSpec with Matchers {
it should "not concurrently execute the side-effecting function" in {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this test while trying to debug this issue. It never reproduced the problem - because the problem wasn't in Dogpile - and due to the unpredictable nature of concurrency problems, the test wouldn't have been guaranteed to spot one if there was one - but I guess it's a reasonable statement of intent.

#90 introduced additional logging,
providing a `logAround()` method that timed the execution of `Future`s.

However, it contained a bug! The `thunk` was passed as a 'by-name'
parameter (see https://docs.scala-lang.org/tour/by-name-parameters.html)
so that it wouldn't start executing until we were ready to start timing
it, which is reasonable. 'by-name' parameters are evaluated *every* time
they are used though, and the `logAround()` method evaluated it _twice_.
So the thunk was executed twice, concurrently, when it was just supposed
to be executed once.

You can see in the logs below that the 3 pieces of code timed with
`logAround()` were executed twice:

```
Jan 31 15:50:48 prout-bot app/web.1 [info] controllers.Api - githubHook repo=guardian/frontend githubDeliveryGuid=Some(0789db88-a17f-11ed-9d54-6d035aa39ad1) xRequestId=Some(78c70f90-622e-4bb0-8d53-c70b0727b4b0)
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch repo hooks' 196 ms : success=true
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch git repo' 341 ms : success=true
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch PRs' 6160 ms : success=true
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:58 prout-bot app/web.1 [info] l.RepoLevelDetails - Need to look at guardian/frontend, branch:main commit AnyObjectId[2bddf3a5f95129cf745eb7843b71ce9f8782eeca]
```

Fetching repo PRs and hooks through GitHub API calls can be duplicated
without much issue (apart from perhaps doubling API quota consumed), but
fetching the git repo itself (cloning/fetching) happens on a fixed folder
on the filesystem, and having simultaneous threads trying to write to that
folder would often lead to exceptions, trying to lock those files - here
are two examples:

```
Jan 30 12:00:01 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/members-data-api - Git Repo ref count: Failure(org.eclipse.jgit.api.errors.TransportException: lock error: /tmp/bot/working-dir/guardian/members-data-api/repo.git/shallow)
```

```
Jan 30 12:44:10 prout-bot app/web.1 Caused by: org.eclipse.jgit.errors.LockFailedException: Cannot lock /tmp/bot/working-dir/guardian/prout/repo.git/config. Ensure that no other process has an open file handle on the lock file /tmp/bot/working-dir/guardian/prout/repo.git/config.lock, then you may delete the lock file and retry.
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.storage.file.FileBasedConfig.save(FileBasedConfig.java:185)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:303)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:191)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:1)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.invoke$1(RepoUtil.scala:39)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getUpToDateRepo$1(RepoUtil.scala:61)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getGitRepo(RepoUtil.scala:68)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoSnapshot$Factory.$anonfun$fetchLatestCopyOfGitRepo$1(RepoSnapshot.scala:121)
```

Sentry does a reasonable job of showing that these errors only started with PR #90 (looking
at the 'First Seen' of 'Jan 26, 5:57 PM'):

https://sentry.io/organizations/the-guardian/issues/3899449647/?project=49913&query=is%3Aunresolved&referrer=issue-stream
@rtyley rtyley force-pushed the fix-double-execution-of-scan-thunk branch from 183b128 to 9dc0980 Compare February 1, 2023 15:29
@@ -46,7 +46,8 @@ class Dogpile[R](thing: => Future[R]) {
*
* @return a future for a run which has been initiated at or after this call
*/
def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { previousState =>
def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { // TODO updateAndGet shouldn't handle side-effects
Copy link
Member Author

@rtyley rtyley Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is concerning to me, but doesn't seem to be currently causing severe problems: the docs for AtomicReference. updateAndGet() say:

The function should be side-effect-free, since it may be re-applied when attempted updates fail due to contention among threads.

I only realised this while trying to hunt down the cause for double-executions just now, and thing is that the function we're using here is not side-effect-free - it's very very side-effecty, labelling repos, creating GitHub comments, etc.

This code should be updated in another PR to be truly safe, and run the side-effecting code with the desired semantics, which are closer to 'throttle-last'.

val allF = Future.traverse(1 to numExecutions)(_ => dogpile.doAtLeastOneMore())
Await.ready(allF, 15.seconds)

executionCount.intValue() should be <= numExecutions
Copy link
Member Author

@rtyley rtyley Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in this test, what I would hope is that the execution count is more like 2, rather than 20, but due to the behaviour of AtomicReference.updateAndGet() we're currently getting the full 20.

@rtyley rtyley requested a review from ioannakok February 1, 2023 15:40
Copy link
Contributor

@ioannakok ioannakok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix and write-up! Thanks! I learned a lot

@rtyley rtyley merged commit 4d42103 into main Feb 1, 2023
@rtyley
Copy link
Member Author

rtyley commented Feb 1, 2023

Thanks! Ok, merging now...

@rtyley rtyley deleted the fix-double-execution-of-scan-thunk branch February 1, 2023 18:15
@rtyley
Copy link
Member Author

rtyley commented Feb 2, 2023

Due to flakey Snyk (see guardian/.github#43 !) this didn't get deployed after merge:

image

Manually deploying now...

@prout-bot
Copy link
Contributor

Seen on PROD (merged by @rtyley 16 hours, 33 minutes and 1 second ago) Please check your changes!

Sentry Release: prout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants