Fix double-execution of repo-scanning functions #92

rtyley · 2023-02-01T12:09:01Z

#90 introduced additional logging, providing a logAround() method that timed the execution of Futures:

Lines 60 to 67 in 8ff1519

    
           def logAround[T](desc: String)(thunk: => Future[T])(implicit repo: Repo): Future[T] = { 
        
             val start = System.currentTimeMillis() 
        
             thunk.onComplete { attempt => 
        
               val elapsedMs = System.currentTimeMillis() - start 
        
               log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}") 
        
             } 
        
             thunk 
        
           }

However, it contained a bug! The thunk (the repo-scanning function) was passed as a 'by-name' parameter, with the intention that it wouldn't start executing until we were ready to start timing it, which is good. But 'by-name' parameters are evaluated every time they are used, and the logAround() method evaluated it twice (on line 62 and 66). So the thunk was executed twice, concurrently, when it was just supposed to be executed once.

There were 3 repo-scanning tasks affected by this:

prout/app/lib/RepoSnapshot.scala

Lines 78 to 80 in 8ff1519

    
           val mergedPullRequestsF = logAround("fetch PRs")(fetchMergedPullRequests()) 
        
           val hooksF = logAround("fetch repo hooks")(fetchRepoHooks()) 
        
           val gitRepoF = logAround("fetch git repo")(fetchLatestCopyOfGitRepo())

You can see in the logs below that the 3 pieces of code timed with logAround() were executed twice:

Jan 31 15:50:48 prout-bot app/web.1 [info] controllers.Api - githubHook repo=guardian/frontend githubDeliveryGuid=Some(0789db88-a17f-11ed-9d54-6d035aa39ad1) xRequestId=Some(78c70f90-622e-4bb0-8d53-c70b0727b4b0)
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch repo hooks' 196 ms : success=true
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393)
Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch git repo' 341 ms : success=true
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch PRs' 6160 ms : success=true
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25
Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749))
Jan 31 15:50:58 prout-bot app/web.1 [info] l.RepoLevelDetails - Need to look at guardian/frontend, branch:main commit AnyObjectId[2bddf3a5f95129cf745eb7843b71ce9f8782eeca]

Of those 3 tasks, fetching repo PRs and hooks through GitHub API calls can be duplicated without much issue (apart from perhaps doubling API quota consumed), but fetching the git repo itself (cloning/fetching) happens on a fixed folder on the filesystem, and having simultaneous threads trying to write to that folder would often lead to exceptions, trying to lock those files - here are two examples of errors:

Jan 30 12:00:01 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/members-data-api - Git Repo ref count: Failure(org.eclipse.jgit.api.errors.TransportException: lock error: /tmp/bot/working-dir/guardian/members-data-api/repo.git/shallow)

Jan 30 12:44:10 prout-bot app/web.1 Caused by: org.eclipse.jgit.errors.LockFailedException: Cannot lock /tmp/bot/working-dir/guardian/prout/repo.git/config. Ensure that no other process has an open file handle on the lock file /tmp/bot/working-dir/guardian/prout/repo.git/config.lock, then you may delete the lock file and retry.
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.storage.file.FileBasedConfig.save(FileBasedConfig.java:185)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:303)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:191)
Jan 30 12:44:10 prout-bot app/web.1 	at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:1)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.invoke$1(RepoUtil.scala:39)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getUpToDateRepo$1(RepoUtil.scala:61)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoUtil$.getGitRepo(RepoUtil.scala:68)
Jan 30 12:44:10 prout-bot app/web.1 	at lib.RepoSnapshot$Factory.$anonfun$fetchLatestCopyOfGitRepo$1(RepoSnapshot.scala:121)

Sentry does a reasonable job of showing that these errors only started with PR #90 (looking at the 'First Seen' of 'Jan 26, 5:57 PM').

rtyley · 2023-02-01T15:21:06Z

app/lib/RepoSnapshot.scala

-      thunk.onComplete { attempt =>
+      val fut = thunk // evaluate thunk, evaluate only once!
+      fut.onComplete { attempt =>
        val elapsedMs = System.currentTimeMillis() - start
        log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}")
      }
-      thunk
+      fut


This is the key part of the fix - we just evaluate thunk once, rather than twice!

rtyley · 2023-02-01T15:24:28Z

test/lib/DogpileSpec.scala

+import scala.concurrent.{Await, ExecutionContext, Future}
+
+class DogpileSpec extends AnyFlatSpec with Matchers {
+  it should "not concurrently execute the side-effecting function" in {


I added this test while trying to debug this issue. It never reproduced the problem - because the problem wasn't in Dogpile - and due to the unpredictable nature of concurrency problems, the test wouldn't have been guaranteed to spot one if there was one - but I guess it's a reasonable statement of intent.

#90 introduced additional logging, providing a `logAround()` method that timed the execution of `Future`s. However, it contained a bug! The `thunk` was passed as a 'by-name' parameter (see https://docs.scala-lang.org/tour/by-name-parameters.html) so that it wouldn't start executing until we were ready to start timing it, which is reasonable. 'by-name' parameters are evaluated *every* time they are used though, and the `logAround()` method evaluated it _twice_. So the thunk was executed twice, concurrently, when it was just supposed to be executed once. You can see in the logs below that the 3 pieces of code timed with `logAround()` were executed twice: ``` Jan 31 15:50:48 prout-bot app/web.1 [info] controllers.Api - githubHook repo=guardian/frontend githubDeliveryGuid=Some(0789db88-a17f-11ed-9d54-6d035aa39ad1) xRequestId=Some(78c70f90-622e-4bb0-8d53-c70b0727b4b0) Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git Jan 31 15:50:51 prout-bot app/web.1 [info] lib.RepoUtil - Updating Git repo with fetch... https://github.com/guardian/frontend.git Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393) Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch repo hooks' 196 ms : success=true Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Git Repo ref count: Success(393) Jan 31 15:50:51 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch git repo' 341 ms : success=true Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25 Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749)) Jan 31 15:50:57 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - 'fetch PRs' 6160 ms : success=true Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - PRs merged to master size=25 Jan 31 15:50:58 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/frontend - Merged Pull Requests fetched: Success(List(25871, 25869, 25868, 25865, 25862, 25861, 25860, 25859, 25857, 25856, 25851, 25850, 25849, 25848, 25846, 25845, 25844, 25842, 25841, 25838, 25837, 25836, 25834, 25792, 25749)) Jan 31 15:50:58 prout-bot app/web.1 [info] l.RepoLevelDetails - Need to look at guardian/frontend, branch:main commit AnyObjectId[2bddf3a5f95129cf745eb7843b71ce9f8782eeca] ``` Fetching repo PRs and hooks through GitHub API calls can be duplicated without much issue (apart from perhaps doubling API quota consumed), but fetching the git repo itself (cloning/fetching) happens on a fixed folder on the filesystem, and having simultaneous threads trying to write to that folder would often lead to exceptions, trying to lock those files - here are two examples: ``` Jan 30 12:00:01 prout-bot app/web.1 [info] c.m.s.GitHub - guardian/members-data-api - Git Repo ref count: Failure(org.eclipse.jgit.api.errors.TransportException: lock error: /tmp/bot/working-dir/guardian/members-data-api/repo.git/shallow) ``` ``` Jan 30 12:44:10 prout-bot app/web.1 Caused by: org.eclipse.jgit.errors.LockFailedException: Cannot lock /tmp/bot/working-dir/guardian/prout/repo.git/config. Ensure that no other process has an open file handle on the lock file /tmp/bot/working-dir/guardian/prout/repo.git/config.lock, then you may delete the lock file and retry. Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.storage.file.FileBasedConfig.save(FileBasedConfig.java:185) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:303) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:191) Jan 30 12:44:10 prout-bot app/web.1 at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:1) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.invoke$1(RepoUtil.scala:39) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.getUpToDateRepo$1(RepoUtil.scala:61) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoUtil$.getGitRepo(RepoUtil.scala:68) Jan 30 12:44:10 prout-bot app/web.1 at lib.RepoSnapshot$Factory.$anonfun$fetchLatestCopyOfGitRepo$1(RepoSnapshot.scala:121) ``` Sentry does a reasonable job of showing that these errors only started with PR #90 (looking at the 'First Seen' of 'Jan 26, 5:57 PM'): https://sentry.io/organizations/the-guardian/issues/3899449647/?project=49913&query=is%3Aunresolved&referrer=issue-stream

rtyley · 2023-02-01T15:34:23Z

app/lib/Dogpile.scala

@@ -46,7 +46,8 @@ class Dogpile[R](thing: => Future[R]) {
   *
   * @return a future for a run which has been initiated at or after this call
   */
-  def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { previousState =>
+  def doAtLeastOneMore(): Future[R] = stateRef.updateAndGet { // TODO updateAndGet shouldn't handle side-effects


This is concerning to me, but doesn't seem to be currently causing severe problems: the docs for AtomicReference. updateAndGet() say:

The function should be side-effect-free, since it may be re-applied when attempted updates fail due to contention among threads.

I only realised this while trying to hunt down the cause for double-executions just now, and thing is that the function we're using here is not side-effect-free - it's very very side-effecty, labelling repos, creating GitHub comments, etc.

This code should be updated in another PR to be truly safe, and run the side-effecting code with the desired semantics, which are closer to 'throttle-last'.

rtyley · 2023-02-01T15:37:18Z

test/lib/DogpileSpec.scala

+    val allF = Future.traverse(1 to numExecutions)(_ => dogpile.doAtLeastOneMore())
+    Await.ready(allF, 15.seconds)
+
+    executionCount.intValue() should be <= numExecutions


Actually, in this test, what I would hope is that the execution count is more like 2, rather than 20, but due to the behaviour of AtomicReference.updateAndGet() we're currently getting the full 20.

ioannakok

Great fix and write-up! Thanks! I learned a lot

rtyley · 2023-02-01T18:15:03Z

Thanks! Ok, merging now...

rtyley · 2023-02-02T10:41:36Z

Due to flakey Snyk (see guardian/.github#43 !) this didn't get deployed after merge:

Manually deploying now...

prout-bot · 2023-02-02T10:48:03Z

Seen on PROD (merged by @rtyley 16 hours, 33 minutes and 1 second ago) Please check your changes!

Sentry Release: prout

rtyley commented Feb 1, 2023

View reviewed changes

rtyley force-pushed the fix-double-execution-of-scan-thunk branch from 183b128 to 9dc0980 Compare February 1, 2023 15:29

rtyley commented Feb 1, 2023

View reviewed changes

rtyley requested a review from ioannakok February 1, 2023 15:40

ioannakok approved these changes Feb 1, 2023

View reviewed changes

rtyley merged commit 4d42103 into main Feb 1, 2023

rtyley deleted the fix-double-execution-of-scan-thunk branch February 1, 2023 18:15

prout-bot added the Pending-on-PROD label Feb 1, 2023

prout-bot added Seen-on-PROD and removed Pending-on-PROD labels Feb 2, 2023

ioannakok mentioned this pull request Feb 3, 2023

Fix issue with Prout taking a long time to notify for frontend guardian/frontend#25828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-execution of repo-scanning functions #92

Fix double-execution of repo-scanning functions #92

rtyley commented Feb 1, 2023 •

edited

Loading

rtyley Feb 1, 2023

ioannakok Feb 1, 2023

rtyley Feb 1, 2023

rtyley Feb 1, 2023 •

edited

Loading

rtyley Feb 1, 2023 •

edited

Loading

ioannakok left a comment

rtyley commented Feb 1, 2023

rtyley commented Feb 2, 2023

prout-bot commented Feb 2, 2023

	def logAround[T](desc: String)(thunk: => Future[T])(implicit repo: Repo): Future[T] = {
	val start = System.currentTimeMillis()
	thunk.onComplete { attempt =>
	val elapsedMs = System.currentTimeMillis() - start
	log(s"'$desc' $elapsedMs ms : success=${attempt.isSuccess}")
	}
	thunk
	}

	val mergedPullRequestsF = logAround("fetch PRs")(fetchMergedPullRequests())
	val hooksF = logAround("fetch repo hooks")(fetchRepoHooks())
	val gitRepoF = logAround("fetch git repo")(fetchLatestCopyOfGitRepo())

Fix double-execution of repo-scanning functions #92

Fix double-execution of repo-scanning functions #92

Conversation

rtyley commented Feb 1, 2023 • edited Loading

rtyley Feb 1, 2023

Choose a reason for hiding this comment

ioannakok Feb 1, 2023

Choose a reason for hiding this comment

rtyley Feb 1, 2023

Choose a reason for hiding this comment

rtyley Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

rtyley Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

ioannakok left a comment

Choose a reason for hiding this comment

rtyley commented Feb 1, 2023

rtyley commented Feb 2, 2023

prout-bot commented Feb 2, 2023

Sentry Release: prout

rtyley commented Feb 1, 2023 •

edited

Loading

rtyley Feb 1, 2023 •

edited

Loading

rtyley Feb 1, 2023 •

edited

Loading