[Investigate] Ada app regularly breaks due to timeout issues #173

sherzinger · 2020-09-22T07:49:41Z

This is a normal log:

[info] application - RedCap dictionary inference and import for data set 'lux_park.clinical_raw' successfully finished.
[info] application - Running SQL on Ignite cache: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  
SQL: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  , finished in 1
[info] application - Running SQL on Ignite cache: select _val from Field   
SQL: select _val from Field   , finished in 119
[info] application - Running SQL on Ignite cache: select _val from DataSetSetting where dataSetId = ?  
SQL: select _val from DataSetSetting where dataSetId = ?  , finished in 1
Creating Elastic Search based data set repo for 'lux_park.clinical_raw'.
[info] application - Deleting the old data set...
[warn] o.e.c.RestClient - request [PUT http://10.240.6.123:9200/data-lux_park.clinical_raw] returned 1 warnings: [299 Elasticsearch-5.6.16-3a740d1 "field [include_in_all] is deprecated, as [_all] is deprecated, and will be disallowed in 6.0, use [copy_to] instead." "Tue, 22 Sep 2020 07:27:42 GMT"]
[info] application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
SQL: select _val from HtmlSnippet where binEquals(snippetId, ?)  , finished in 2
[info] application - Inheriting fields from the first visit...
[info] application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
SQL: select _val from HtmlSnippet where binEquals(snippetId, ?)  , finished in 2
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...

This is what it looks like when it breaks:

2020-09-22 01:04:21,318 [INFO] from application - RedCap dictionary inference and import for data set 'lux_park.clinical_raw' successfully finished.
2020-09-22 01:04:21,331 [INFO] from application - Running SQL on Ignite cache: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  
2020-09-22 01:04:21,333 [INFO] from application - Running SQL on Ignite cache: select _val from Field   
2020-09-22 01:04:21,534 [INFO] from application - Running SQL on Ignite cache: select _val from DataSetSetting where dataSetId = ?  
2020-09-22 01:04:21,558 [INFO] from application - Deleting the old data set...
2020-09-22 01:04:23,140 [WARN] from org.elasticsearch.client.RestClient - request [PUT http://10.240.6.123:9200/data-lux_park.clinical_raw] returned 1 warnings: [299 Elasticsearch-5.6.16-3a740d1 "field [include_in_all] is deprecated, as [_all] is
 deprecated, and will be disallowed in 6.0, use [copy_to] instead." "Mon, 21 Sep 2020 23:04:22 GMT"]
2020-09-22 01:05:10,757 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:05:30,067 [INFO] from application - Inheriting fields from the first visit...
2020-09-22 01:06:59,864 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:06:59,864 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:08:22,919 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:08:22,919 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:09:28,126 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:10:35,220 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:12:50,000 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:12:50,006 [WARN] from org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
        at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        ... 14 common frames omitted
2020-09-22 01:12:50,086 [WARN] from reactivemongo.core.actors.MongoDBSystem - [Supervisor-2/Connection-1] Node[10.240.6.122:27017: Unknown (0/20/20 available connections), latency=66383867067ns, authenticated={}] hasn't answered in time to last ping! Please check its connectivity (<time:2478729821352742>).

The text was updated successfully, but these errors were encountered:

sherzinger · 2020-09-29T11:45:58Z

Might have been those settings which were missing after the migration

spark.executor.memory = "8g" 
spark.network.timeout = "10000000"
spark.rdd.compress = "true"
spark.executor.extraJavaOptions = "-XX:+UseG1GC"
spark.worker.cleanup.appDataTtl = "2700"

sherzinger · 2020-10-05T12:12:42Z

Again crashed during IBBL import. Now trying to automatically restart VM at certain times to check whether there is maybe some memory leak.

sherzinger · 2020-10-14T06:23:31Z

Issue still occurs. Maybe Mongo is the culprit?

sherzinger · 2020-10-27T07:07:05Z

Restart mongo regularly didn't help. Gonna try to increase the channel count to 100, maybe there's not enough channels for some reason.

sherzinger added bug Something isn't working CRITICAL Prio Highest priority. Drop everything else. labels Sep 22, 2020

sherzinger self-assigned this Sep 22, 2020

sherzinger removed the CRITICAL Prio Highest priority. Drop everything else. label Jul 29, 2021

sherzinger removed their assignment Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Investigate] Ada app regularly breaks due to timeout issues #173

[Investigate] Ada app regularly breaks due to timeout issues #173

sherzinger commented Sep 22, 2020

sherzinger commented Sep 29, 2020

sherzinger commented Oct 5, 2020

sherzinger commented Oct 14, 2020

sherzinger commented Oct 27, 2020

[Investigate] Ada app regularly breaks due to timeout issues #173

[Investigate] Ada app regularly breaks due to timeout issues #173

Comments

sherzinger commented Sep 22, 2020

sherzinger commented Sep 29, 2020

sherzinger commented Oct 5, 2020

sherzinger commented Oct 14, 2020

sherzinger commented Oct 27, 2020