Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Investigate] Ada app regularly breaks due to timeout issues #173

Open
sherzinger opened this issue Sep 22, 2020 · 4 comments
Open

[Investigate] Ada app regularly breaks due to timeout issues #173

sherzinger opened this issue Sep 22, 2020 · 4 comments
Labels
bug Something isn't working

Comments

@sherzinger
Copy link
Member

This is a normal log:

[info] application - RedCap dictionary inference and import for data set 'lux_park.clinical_raw' successfully finished.
[info] application - Running SQL on Ignite cache: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  
SQL: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  , finished in 1
[info] application - Running SQL on Ignite cache: select _val from Field   
SQL: select _val from Field   , finished in 119
[info] application - Running SQL on Ignite cache: select _val from DataSetSetting where dataSetId = ?  
SQL: select _val from DataSetSetting where dataSetId = ?  , finished in 1
Creating Elastic Search based data set repo for 'lux_park.clinical_raw'.
[info] application - Deleting the old data set...
[warn] o.e.c.RestClient - request [PUT http://10.240.6.123:9200/data-lux_park.clinical_raw] returned 1 warnings: [299 Elasticsearch-5.6.16-3a740d1 "field [include_in_all] is deprecated, as [_all] is deprecated, and will be disallowed in 6.0, use [copy_to] instead." "Tue, 22 Sep 2020 07:27:42 GMT"]
[info] application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
SQL: select _val from HtmlSnippet where binEquals(snippetId, ?)  , finished in 2
[info] application - Inheriting fields from the first visit...
[info] application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
SQL: select _val from HtmlSnippet where binEquals(snippetId, ?)  , finished in 2
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...
[info] application - Saving 3 records...

This is what it looks like when it breaks:

2020-09-22 01:04:21,318 [INFO] from application - RedCap dictionary inference and import for data set 'lux_park.clinical_raw' successfully finished.
2020-09-22 01:04:21,331 [INFO] from application - Running SQL on Ignite cache: select _val from Category where name in (?,?,?,?,?,?,?,?,?,?,?)  
2020-09-22 01:04:21,333 [INFO] from application - Running SQL on Ignite cache: select _val from Field   
2020-09-22 01:04:21,534 [INFO] from application - Running SQL on Ignite cache: select _val from DataSetSetting where dataSetId = ?  
2020-09-22 01:04:21,558 [INFO] from application - Deleting the old data set...
2020-09-22 01:04:23,140 [WARN] from org.elasticsearch.client.RestClient - request [PUT http://10.240.6.123:9200/data-lux_park.clinical_raw] returned 1 warnings: [299 Elasticsearch-5.6.16-3a740d1 "field [include_in_all] is deprecated, as [_all] is
 deprecated, and will be disallowed in 6.0, use [copy_to] instead." "Mon, 21 Sep 2020 23:04:22 GMT"]
2020-09-22 01:05:10,757 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:05:30,067 [INFO] from application - Inheriting fields from the first visit...
2020-09-22 01:06:59,864 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:06:59,864 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:08:22,919 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:08:22,919 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:09:28,126 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:10:35,220 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:12:50,000 [INFO] from application - Running SQL on Ignite cache: select _val from HtmlSnippet where binEquals(snippetId, ?)  
2020-09-22 01:12:50,006 [WARN] from org.apache.spark.executor.Executor - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
        at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
        at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
        at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        ... 14 common frames omitted
2020-09-22 01:12:50,086 [WARN] from reactivemongo.core.actors.MongoDBSystem - [Supervisor-2/Connection-1] Node[10.240.6.122:27017: Unknown (0/20/20 available connections), latency=66383867067ns, authenticated={}] hasn't answered in time to last ping! Please check its connectivity (<time:2478729821352742>).

@sherzinger sherzinger added bug Something isn't working CRITICAL Prio Highest priority. Drop everything else. labels Sep 22, 2020
@sherzinger sherzinger self-assigned this Sep 22, 2020
@sherzinger
Copy link
Member Author

Might have been those settings which were missing after the migration

spark.executor.memory = "8g" 
spark.network.timeout = "10000000"
spark.rdd.compress = "true"
spark.executor.extraJavaOptions = "-XX:+UseG1GC"
spark.worker.cleanup.appDataTtl = "2700"

@sherzinger
Copy link
Member Author

Again crashed during IBBL import. Now trying to automatically restart VM at certain times to check whether there is maybe some memory leak.

@sherzinger
Copy link
Member Author

Issue still occurs. Maybe Mongo is the culprit?

@sherzinger
Copy link
Member Author

Restart mongo regularly didn't help. Gonna try to increase the channel count to 100, maybe there's not enough channels for some reason.

@sherzinger sherzinger removed the CRITICAL Prio Highest priority. Drop everything else. label Jul 29, 2021
@sherzinger sherzinger removed their assignment Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant