Skip to content

Commit

Permalink
[CELEBORN-768] Change default config values for batch rpcs and netty …
Browse files Browse the repository at this point in the history
…memory allocator

### What changes were proposed in this pull request?
Changes the following configs' default values
| config  | previous value | current value |
| ------------- | ------------- | ------------- |
| celeborn.network.memory.allocator.share  | false | true |
| celeborn.client.shuffle.batchHandleChangePartition.enabled  | false | true |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | true |

### Why are the changes needed?
In my test, when graceful shutdown is enabled but ```celeborn.client.shuffle.batchHandleChangePartition.enabled``` and ```celeborn.client.shuffle.batchHandleCommitPartition.enabled``` disabled, the worker takes much longer to stop than the two configs enabled.
In another test where worker size is quite small(2 cores 4 G) and replication is on, if shared allocator is disabled, the netty's onTrim fails to release memory, and further causes push data timeout.

### Does this PR introduce _any_ user-facing change?
No, these conifgs are introduces from 0.3.0.

### How was this patch tested?
Passes GA.

Closes #1682 from waitinfuture/768.

Authored-by: zky.zhoukeyong <[email protected]>
Signed-off-by: zky.zhoukeyong <[email protected]>
  • Loading branch information
waitinfuture committed Jul 5, 2023
1 parent 95f0830 commit 4300835
Show file tree
Hide file tree
Showing 3 changed files with 7 additions and 7 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -1216,10 +1216,11 @@ object CelebornConf extends Logging {
val NETWORK_MEMORY_ALLOCATOR_SHARE: ConfigEntry[Boolean] =
buildConf("celeborn.network.memory.allocator.share")
.categories("network")
.internal
.version("0.3.0")
.doc("Whether to share memory allocator.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

val NETWORK_MEMORY_ALLOCATOR_ARENAS: OptionalConfigEntry[Int] =
buildConf("celeborn.network.memory.allocator.numArenas")
Expand Down Expand Up @@ -2995,11 +2996,12 @@ object CelebornConf extends Logging {
buildConf("celeborn.client.shuffle.batchHandleChangePartition.enabled")
.withAlternative("celeborn.shuffle.batchHandleChangePartition.enabled")
.categories("client")
.internal
.doc("When true, LifecycleManager will handle change partition request in batch. " +
"Otherwise, LifecycleManager will process the requests one by one")
.version("0.3.0")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

val CLIENT_BATCH_HANDLE_CHANGE_PARTITION_THREADS: ConfigEntry[Int] =
buildConf("celeborn.client.shuffle.batchHandleChangePartition.threads")
Expand All @@ -3023,11 +3025,12 @@ object CelebornConf extends Logging {
buildConf("celeborn.client.shuffle.batchHandleCommitPartition.enabled")
.withAlternative("celeborn.shuffle.batchHandleCommitPartition.enabled")
.categories("client")
.internal
.doc("When true, LifecycleManager will handle commit partition request in batch. " +
"Otherwise, LifecycleManager won't commit partition before stage end")
.version("0.3.0")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

val CLIENT_BATCH_HANDLE_COMMIT_PARTITION_THREADS: ConfigEntry[Int] =
buildConf("celeborn.client.shuffle.batchHandleCommitPartition.threads")
Expand All @@ -3050,6 +3053,7 @@ object CelebornConf extends Logging {
val CLIENT_BATCH_HANDLE_RELEASE_PARTITION_ENABLED: ConfigEntry[Boolean] =
buildConf("celeborn.client.shuffle.batchHandleReleasePartition.enabled")
.categories("client")
.internal
.doc("When true, LifecycleManager will handle release partition request in batch. " +
"Otherwise, LifecycleManager will process release partition request immediately")
.version("0.3.0")
Expand Down
3 changes: 0 additions & 3 deletions docs/configuration/client.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,10 @@ license: |
| celeborn.client.rpc.registerShuffle.askTimeout | &lt;value of celeborn.&lt;module&gt;.io.connectionTimeout&gt; | Timeout for ask operations during register shuffle. During this process, there are two times for retry opportunities for requesting slots, one request for establishing a connection with Worker and `celeborn.client.reserveSlots.maxRetries` times for retry opportunities for reserving slots. User can customize this value according to your setting. By default, the value is the max timeout value `celeborn.<module>.io.connectionTimeout`. | 0.3.0 |
| celeborn.client.rpc.requestPartition.askTimeout | &lt;value of celeborn.&lt;module&gt;.io.connectionTimeout&gt; | Timeout for ask operations during requesting change partition location, such as reviving or spliting partition. During this process, there are `celeborn.client.reserveSlots.maxRetries` times for retry opportunities for reserving slots. User can customize this value according to your setting. By default, the value is the max timeout value `celeborn.<module>.io.connectionTimeout`. | 0.2.0 |
| celeborn.client.rpc.reserveSlots.askTimeout | &lt;value of celeborn.rpc.askTimeout&gt; | Timeout for LifecycleManager request reserve slots. | 0.3.0 |
| celeborn.client.shuffle.batchHandleChangePartition.enabled | false | When true, LifecycleManager will handle change partition request in batch. Otherwise, LifecycleManager will process the requests one by one | 0.3.0 |
| celeborn.client.shuffle.batchHandleChangePartition.interval | 100ms | Interval for LifecycleManager to schedule handling change partition requests in batch. | 0.3.0 |
| celeborn.client.shuffle.batchHandleChangePartition.threads | 8 | Threads number for LifecycleManager to handle change partition request in batch. | 0.3.0 |
| celeborn.client.shuffle.batchHandleCommitPartition.enabled | false | When true, LifecycleManager will handle commit partition request in batch. Otherwise, LifecycleManager won't commit partition before stage end | 0.3.0 |
| celeborn.client.shuffle.batchHandleCommitPartition.interval | 5s | Interval for LifecycleManager to schedule handling commit partition requests in batch. | 0.3.0 |
| celeborn.client.shuffle.batchHandleCommitPartition.threads | 8 | Threads number for LifecycleManager to handle commit partition request in batch. | 0.3.0 |
| celeborn.client.shuffle.batchHandleReleasePartition.enabled | true | When true, LifecycleManager will handle release partition request in batch. Otherwise, LifecycleManager will process release partition request immediately | 0.3.0 |
| celeborn.client.shuffle.batchHandleReleasePartition.interval | 5s | Interval for LifecycleManager to schedule handling release partition requests in batch. | 0.3.0 |
| celeborn.client.shuffle.batchHandleReleasePartition.threads | 8 | Threads number for LifecycleManager to handle release partition request in batch. | 0.3.0 |
| celeborn.client.shuffle.compression.codec | LZ4 | The codec used to compress shuffle data. By default, Celeborn provides three codecs: `lz4`, `zstd`, `none`. | 0.3.0 |
Expand Down
1 change: 0 additions & 1 deletion docs/configuration/network.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ license: |
| celeborn.network.bind.preferIpAddress | true | When `ture`, prefer to use IP address, otherwise FQDN. This configuration only takes effects when the bind hostname is not set explicitly, in such case, Celeborn will find the first non-loopback address to bind. | 0.3.0 |
| celeborn.network.connect.timeout | 10s | Default socket connect timeout. | 0.2.0 |
| celeborn.network.memory.allocator.numArenas | &lt;undefined&gt; | Number of arenas for pooled memory allocator. Default value is Runtime.getRuntime.availableProcessors, min value is 2. | 0.3.0 |
| celeborn.network.memory.allocator.share | false | Whether to share memory allocator. | 0.3.0 |
| celeborn.network.memory.allocator.verbose.metric | false | Weather to enable verbose metric for pooled allocator. | 0.3.0 |
| celeborn.network.timeout | 240s | Default timeout for network operations. | 0.2.0 |
| celeborn.port.maxRetries | 1 | When port is occupied, we will retry for max retry times. | 0.2.0 |
Expand Down

0 comments on commit 4300835

Please sign in to comment.