Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1646] Catch exception of Files#getFileStore for DeviceMonitor and StorageManager for input/ouput error #2809

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented Oct 14, 2024

What changes were proposed in this pull request?

Catch exception of Files#getFileStore for DeviceMonitor and StorageManager for input/ouput error.

Why are the changes needed?

DeviceMonitor uses Files#getFileStore to record DeviceOSTotalBytes and DeviceOSFreeBytes gauges at present, which could causes the metric corruption for input/ouput error of disk.

image
2024-10-14 12:00:11,805 [WARN] [worker-JettyThreadPool-3119995] - org.apache.celeborn.server.common.http.HttpUtils -Logging.scala(76) -GET /metrics failed: java.nio.file.FileSystemException: /mnt/storage01: Input/output error
java.nio.file.FileSystemException: /mnt/storage01: Input/output error
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:1.8.0_162]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_162]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:57) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileStore.<init>(UnixFileStore.java:64) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileStore.<init>(LinuxFileStore.java:44) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileSystemProvider.getFileStore(LinuxFileSystemProvider.java:51) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileSystemProvider.getFileStore(LinuxFileSystemProvider.java:39) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileSystemProvider.getFileStore(UnixFileSystemProvider.java:368) ~[?:1.8.0_162]
	at java.nio.file.Files.getFileStore(Files.java:1461) ~[?:1.8.0_162]
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.getDiskUsageInfos(DeviceMonitor.scala:231) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor.usage$1(DeviceMonitor.scala:89) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor.$anonfun$init$7(DeviceMonitor.scala:91) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.common.metrics.source.GaugeSupplier$$anon$3.getValue(AbstractSource.scala:470) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.common.metrics.source.AbstractSource.recordGauge(AbstractSource.scala:346) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.common.metrics.source.AbstractSource.$anonfun$getMetrics$2(AbstractSource.scala:406) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.common.metrics.source.AbstractSource.$anonfun$getMetrics$2$adapted(AbstractSource.scala:406) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.collection.immutable.List.foreach(List.scala:392) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.common.metrics.source.AbstractSource.getMetrics(AbstractSource.scala:406) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.common.metrics.sink.AbstractServlet.$anonfun$getMetricsSnapshot$1(AbstractServlet.scala:34) ~[celeborn-service_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) ~[scala-library-2.12.10.jar:?]
	at scala.collection.Iterator.foreach(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.Iterator.foreach$(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.10.jar:?]
	at scala.collection.TraversableLike.map(TraversableLike.scala:238) ~[scala-library-2.12.10.jar:?]
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractTraversable.map(Traversable.scala:108) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.common.metrics.sink.AbstractServlet.getMetricsSnapshot(AbstractServlet.scala:34) ~[celeborn-service_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.common.metrics.sink.PrometheusServlet.$anonfun$createServletHandler$1(PrometheusServlet.scala:38) ~[celeborn-service_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.server.common.http.HttpUtils$$anon$1.doGet(HttpUtils.scala:51) ~[celeborn-service_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:497) ~[jakarta.servlet-api-4.0.4.jar:4.0.4]
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:584) ~[jakarta.servlet-api-4.0.4.jar:4.0.4]
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799) ~[jetty-servlet-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:554) ~[jetty-servlet-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505) ~[jetty-servlet-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:234) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.Server.handle(Server.java:516) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277) ~[jetty-server-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104) ~[jetty-io-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034) ~[jetty-util-9.4.52.v20230823.jar:9.4.52.v20230823]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
2024-10-14 12:00:42,701 [ERROR] [worker-disk-checker] - org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor -Logging.scala(80) -Device check failed.
java.util.concurrent.ExecutionException: java.nio.file.FileSystemException: /mnt/storage01: Input/output error
	at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_162]
	at java.util.concurrent.FutureTask.get(FutureTask.java:206) ~[?:1.8.0_162]
	at org.apache.celeborn.common.util.Utils$.tryWithTimeoutAndCallback(Utils.scala:950) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.highDiskUsage(DeviceMonitor.scala:268) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$9(DeviceMonitor.scala:137) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$9$adapted(DeviceMonitor.scala:136) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.collection.Iterator.foreach(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.Iterator.foreach$(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$2(DeviceMonitor.scala:136) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$2$adapted(DeviceMonitor.scala:111) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.collection.Iterator.foreach(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.Iterator.foreach$(Iterator.scala:941) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach(IterableLike.scala:74) ~[scala-library-2.12.10.jar:?]
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73) ~[scala-library-2.12.10.jar:?]
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.run(DeviceMonitor.scala:111) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_162]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_162]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_162]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_162]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_162]
	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_162]
Caused by: java.nio.file.FileSystemException: /mnt/storage01: Input/output error
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:1.8.0_162]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_162]
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileStore.devFor(UnixFileStore.java:57) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileStore.<init>(UnixFileStore.java:64) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileStore.<init>(LinuxFileStore.java:44) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileSystemProvider.getFileStore(LinuxFileSystemProvider.java:51) ~[?:1.8.0_162]
	at sun.nio.fs.LinuxFileSystemProvider.getFileStore(LinuxFileSystemProvider.java:39) ~[?:1.8.0_162]
	at sun.nio.fs.UnixFileSystemProvider.getFileStore(UnixFileSystemProvider.java:368) ~[?:1.8.0_162]
	at java.nio.file.Files.getFileStore(Files.java:1461) ~[?:1.8.0_162]
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.getDiskUsageInfos(DeviceMonitor.scala:231) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.$anonfun$highDiskUsage$1(DeviceMonitor.scala:248) ~[celeborn-worker_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) ~[scala-library-2.12.10.jar:?]
	at org.apache.celeborn.common.util.Utils$$anon$3.call(Utils.scala:943) ~[celeborn-common_2.12-0.5.0-SNAPSHOT.jar:0.5.0-SNAPSHOT]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_162]
	... 3 more

Meanwhile, StorageManager should not update space related diskInfo for input/ouput error of disk.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Cluster test.

writers.values.asScala.map(_.getDiskFileInfo.getFileLength).sum
disksSnapshot()
.filter(diskInfo =>
diskInfo.status != DiskStatus.IO_HANG || diskInfo.status != DiskStatus.READ_OR_WRITE_FAILURE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to filter CRITICAL_ERROR as well?

false
!disksSnapshot()
.filter(diskInfo =>
diskInfo.status != DiskStatus.IO_HANG || diskInfo.status != DiskStatus.READ_OR_WRITE_FAILURE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any problem about this? why need filter READ_OR_WRITE_FAILURE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants