HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss #7179
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Problem background:
There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid transitioning datanodes to decommissioned state when they have open (i.e. Under Construction) blocks:
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java
Line 357 in cd2cffe
hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java
Line 305 in cd2cffe
This logic does not work correctly because, as mentioned above, DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor StorageInfos which does not include Under Construction blocks for which the DFSOutputStream has not been closed yet.
There is also logic in the HDFS DataStreamer client which will replace bad/dead datanodes in the block write pipeline. Note that:
Overall, the Namenode should not be putting datanodes with open blocks into decommissioned state & hope that the DataStreamer client is able to replace them when the decommissioned datanodes are terminated. This will not work depending on the timing & therefore is not a solution which guarantees correctness.
The Namenode needs to honor the rule that "a datanode should only enter decommissioned state if all the blocks on the datanode are sufficiently replicated to other live datanodes", even for blocks which are currently Under Construction.
Potential Solutions
One possible opinion is that if the DFSOutputStream has not been successfuly closed yet, then the client should be able to replay all the data if there is a failure. The client should not have any expectation the data is committed to HDFS until the DFSOutputStream is closed. There are a few reasons I do not think this makes sense:
Another possible option that comes to mind is to add blocks to StorageInfos before they are finalized. However, this change also is likely to have wider implications on block management.
Without modifying any existing block management logic, we can add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.
Solution
Add a new data structure (UnderConstructionBlocks) which temporarily tracks the Under Construction blocks in-memory until they are committed/finalized & added to the StorageInfos.
Pros:
Implementation Details
Feature is behind a configuration "dfs.namenode.decommission.track.underconstructionblocks" which is disabled by default.
In the regular case, when a DFSOutputStream is closed it takes 1-2 seconds for the block replicas to be removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, datanode decommissioning is only blocked until the DFSOutputStream is closed & the write operation is finished, after this time there is minimal delay in unblocking decommissioning.
In the unhappy case, when an HDFS client fails & the DFSOutputStream is never closed, then it takes
dfs.namenode.lease-hard-limit-sec = 20
minutes before the lease expires & the Namenode recovers the block. As part of block recovery, the block replicas are removed from UnderConstructionBlocks & added to the StorageInfos. Therefore, if an HDFS client fails it will (by default) take 20 minutes before decommissioning becomes unblocked.The UnderConstructionBlocks data structure is in-memory only & therefore if the Namenode is restarted then it will lose track of any previously reported Under Construction blocks. This means that datanodes can be decommissioned with Under Construction blocks if the Namenode is restarted (which makes HDFS data loss & write failures possible again).
Testing shows that UnderConstructionBlocks should not leak any Under Construction blocks (i.e. which are never removed). However, as a safeguard to monitor for this issue, any block replica open for over 2 hours will have a WARN log printed by the Namenode every 30 minutes which mentions how long the block has been open for.
The implementation of UnderConstructionBlocks was borrowed from existing code in PendingDataNodeMessages. PendingDataNodeMessages is already used by the BlockManager to track in-memory the block replicas which have been reported to the standby namenode out-of-order.
How was this patch tested?
TODO - will add detailed test results
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?