Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-764] Fix celeborn on HDFS might clean using app directories. #1678

Closed
wants to merge 31 commits into from

Conversation

FMX
Copy link
Contributor

@FMX FMX commented Jul 4, 2023

What changes were proposed in this pull request?

Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

Why are the changes needed?

If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

Does this PR introduce any user-facing change?

NO.

How was this patch tested?

UT and cluster.

@codecov
Copy link

codecov bot commented Jul 4, 2023

Codecov Report

Merging #1678 (063bc9c) into main (de0fd8c) will increase coverage by 0.05%.
The diff coverage is 29.73%.

❗ Current head 063bc9c differs from pull request most recent head e6ff705. Consider uploading reports for the commit e6ff705 to get more accurate results

@@            Coverage Diff             @@
##             main    #1678      +/-   ##
==========================================
+ Coverage   46.22%   46.27%   +0.05%     
==========================================
  Files         161      161              
  Lines        9957     9990      +33     
  Branches      920      924       +4     
==========================================
+ Hits         4602     4622      +20     
- Misses       5051     5061      +10     
- Partials      304      307       +3     
Impacted Files Coverage Δ
...java/org/apache/celeborn/common/meta/FileInfo.java 50.00% <ø> (ø)
...born/common/protocol/message/ControlMessages.scala 1.53% <0.00%> (-<0.01%) ⬇️
.../scala/org/apache/celeborn/common/util/Utils.scala 17.66% <ø> (ø)
...che/celeborn/common/util/CelebornHadoopUtils.scala 19.24% <4.77%> (-66.48%) ⬇️
...cala/org/apache/celeborn/common/CelebornConf.scala 87.15% <83.34%> (+0.02%) ⬆️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@@ -534,7 +534,8 @@ final private[worker] class StorageManager(conf: CelebornConf, workerSource: Abs
val iter = hadoopFs.listStatusIterator(hdfsWorkPath)
while (iter.hasNext) {
val fileStatus = iter.next()
if (!appIds.contains(fileStatus.getPath.getName)) {
if (!appIds.contains(fileStatus.getPath.getName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getModificationTime will not reflect the change in nested directory. For example I have path /tmp/test/, then I upload a new file into /tmp/test, the modified time of /tmp will not change.
IMO, HDFS directory does not belong to worker, maybe we should let Master to clean hdfs. cc @pan3793 @RexXiong @AngersZhuuuu

@FMX FMX force-pushed the CELEBORN-764 branch 2 times, most recently from dec6671 to 998c20d Compare July 5, 2023 03:42
@AngersZhuuuu
Copy link
Contributor

We'd better change the doc of hdfs path configuration to mention this change and let user make sure the configuration is same in master and worker side

@FMX
Copy link
Contributor Author

FMX commented Jul 5, 2023

Updated.

@@ -1072,4 +1074,20 @@ object Utils extends Logging {
}
labelPart(0).trim -> labelPart(1).trim
}

def getHadoopFS(conf: CelebornConf): FileSystem = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving it to CelebornHadoopUtils? and we should use CelebornHadoopUtils#newConfiguration instead of new Configuration()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here needs an empty configuration so that celeborn conf can override pre-defined settings. So It can not be moved to CelebornHadoopUtils.

Copy link
Member

@pan3793 pan3793 Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the configuration priority here is

  1. CelebornConf
  2. hardcoded
  3. core-site.xml, hdfs-site.xml

It changes the current behavior but looks reasonable. I'm wondering if we can change the CelebornHadoopUtils#newConfiguration and pass a Map to achieve this ability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, we'd better to document this behavior in some place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed.

.withAlternative("celeborn.storage.hdfs.dir")
.categories("worker")
buildConf("celeborn.storage.hdfs.dir")
.withAlternative("celeborn.worker.storage.hdfs.dir")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we don't need this alternative, it was called celeborn.storage.hdfs.dir in 0.2

FMX and others added 3 commits July 5, 2023 14:39
@@ -138,6 +138,12 @@ public void updateAppHeartbeatMeta(String appId, long time, long totalWritten, l
partitionTotalFileCount.add(fileCount);
}

public Set<String> getActiveAppIds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use appHeartbeatTime map to get active appIds. Master may encounter corner case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, We can directly use appHeartbeatTime keySet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, We can directly use appHeartbeatTime keySet.

+1, one application may have many shuffleKeys

@@ -653,11 +656,33 @@ private[celeborn] class Master(
override def run(): Unit = {
statusSystem.handleAppLost(appId, requestId)
logInfo(s"Removed application $appId")
// only leader can clean hdfs dirs
if (conf.hasHDFSStorage && !conf.hdfsDir.isEmpty) {
cleanExpiredAppDirsOnHDFS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cost lot as applications may lost frequently in a big cluster, we would better not clean expired app dirs in handleApplicationLost, instead we need do this in timeoutDeadApplications after we handleApplicationLost. And if this cost lot, I suggest we can cache this list and refresh every x(3) min turn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I'll move the clean logic to timeoutDeadApplications. I think listing directories won't cost a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cost lot as applications may lost frequently in a big cluster, we would better not clean expired app dirs in handleApplicationLost, instead we need do this in timeoutDeadApplications after we handleApplicationLost. And if this cost lot, I suggest we can cache this list and refresh every x(3) min turn?

I also think it's not a good idea to call cleanExpiredAppDirsOnHDFS when app lost. IMO we can just use forwardMessageThread.scheduleAtFixedRate to check like checkForWorkerTimeOutTask and checkForApplicationTimeOutTask. Also we need to change forwardMessageThread from single thread to multiple threads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to ensure that the leader has replayed all raft logs before cleanup.

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except to log on deleting

@@ -1509,6 +1510,14 @@ object CelebornConf extends Logging {
.timeConf(TimeUnit.MILLISECONDS)
.createWithDefaultString("300s")

val HDFS_REMNANTDIRS_TIMEOUT: ConfigEntry[Long] =
buildConf("celeborn.master.hdfs.remnantDirs.timeout")
Copy link
Member

@pan3793 pan3793 Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage namespace should be used, and we'd better emphasize it applies to app level in the name

Suggested change
buildConf("celeborn.master.hdfs.remnantDirs.timeout")
buildConf("celeborn.master.storage.hdfs.appDirs.expiredDuration")

Or celeborn.master.storage.hdfs.appRemnantDirs.expiredDuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the ding-ding group, I explained the reason why I choose this name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu do you have suggestion for its name?

val startTime = System.currentTimeMillis()
val fileStatus = iter.next()
if (!statusSystem.appHeartbeatTime.containsKey(fileStatus.getPath.getName)) {
hadoopFs.delete(fileStatus.getPath, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try catch the IOException and print the failed to delete dir in case HDFS is abnormal or some permission deny issues

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

FMX added 2 commits July 5, 2023 19:59
# Conflicts:
#	master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala
waitinfuture pushed a commit that referenced this pull request Jul 5, 2023
### What changes were proposed in this pull request?
Make Celeborn leader clean expired app dirs on HDFS when an application is Lost.

### Why are the changes needed?
If Celeborn is working on HDFS, the storage manager starts and cleans expired app directories, and the newly created worker will want to delete any unknown app directories.
This will cause using app directories to be deleted unexpectedly.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and cluster.

Closes #1678 from FMX/CELEBORN-764.

Lead-authored-by: mingji <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Co-authored-by: Ethan Feng <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
(cherry picked from commit d0ecf83)
Signed-off-by: zky.zhoukeyong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants