Skip to content

Commit

Permalink
[KYUUBI #5941] Drop Kubernetes Block Cleaner Tool from Kyuubi
Browse files Browse the repository at this point in the history
# 🔍 Description
## Issue References 🔗

This pull request fixes #5941

## Describe Your Solution 🔧

Originally aims to support Spark On Kubernetes Shuffle Data Clean, limitations became more and more apparent over time, so let's drop this.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️

#### Behavior With This Pull Request 🎉

#### Related Unit Tests

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #5942 from zwangsheng/KYUUBI#5941.

Closes #5941

23bf14f [Cheng Pan] Update docs/tools/spark_block_cleaner.md
1c33501 [zwangsheng] fix comment
0bdbb11 [zwangsheng] nit
0a5aa2b [zwangsheng] fix comments

Lead-authored-by: zwangsheng <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
  • Loading branch information
zwangsheng and pan3793 committed Jan 4, 2024
1 parent 9fefd47 commit c6bba91
Show file tree
Hide file tree
Showing 14 changed files with 8 additions and 688 deletions.
6 changes: 2 additions & 4 deletions .github/labeler.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,7 @@
'.dockerignore',
'bin/docker-image-tool.sh',
'docker/**/*',
'integration-tests/kyuubi-kubernetes-it/**/*',
'tools/spark-block-cleaner/**/*'
'integration-tests/kyuubi-kubernetes-it/**/*'
]

"module:metrics":
Expand Down Expand Up @@ -164,8 +163,7 @@
- changed-files:
- any-glob-to-any-file: [
'externals/kyuubi-spark-sql-engine/**/*',
'extensions/spark/**/*',
'tools/spark-block-cleaner/**/*'
'extensions/spark/**/*'
]

"module:extensions":
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/license.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
check-latest: false
- run: >-
build/mvn org.apache.rat:apache-rat-plugin:check
-Ptpcds -Pspark-block-cleaner -Pkubernetes-it
-Ptpcds -Pkubernetes-it
-Pspark-3.1 -Pspark-3.2 -Pspark-3.3 -Pspark-3.4 -Pspark-3.5
- name: Upload rat report
if: failure()
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
strategy:
matrix:
profiles:
- '-Pflink-provided,hive-provided,spark-provided,spark-block-cleaner,spark-3.5,spark-3.4,spark-3.3,spark-3.2,tpcds,kubernetes-it'
- '-Pflink-provided,hive-provided,spark-provided,spark-3.5,spark-3.4,spark-3.3,spark-3.2,tpcds,kubernetes-it'

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -65,7 +65,7 @@ jobs:
if: steps.modules-check.conclusion == 'success' && steps.modules-check.outcome == 'failure'
run: |
MVN_OPT="-DskipTests -Dorg.slf4j.simpleLogger.defaultLogLevel=warn -Dmaven.javadoc.skip=true -Drat.skip=true -Dscalastyle.skip=true -Dspotless.check.skip"
build/mvn clean install ${MVN_OPT} -Pflink-provided,hive-provided,spark-provided,spark-block-cleaner,spark-3.2,tpcds
build/mvn clean install ${MVN_OPT} -Pflink-provided,hive-provided,spark-provided,spark-3.2,tpcds
build/mvn clean install ${MVN_OPT} -pl extensions/spark/kyuubi-extension-spark-3-1 -Pspark-3.1
build/mvn clean install ${MVN_OPT} -pl extensions/spark/kyuubi-extension-spark-3-3,extensions/spark/kyuubi-spark-connector-hive -Pspark-3.3
build/mvn clean install ${MVN_OPT} -pl extensions/spark/kyuubi-extension-spark-3-4 -Pspark-3.4
Expand Down
8 changes: 0 additions & 8 deletions build/dist
Original file line number Diff line number Diff line change
Expand Up @@ -330,14 +330,6 @@ for jar in $(ls "$DISTDIR/jars/"); do
fi
done

# Copy kyuubi tools
if [[ -f "$KYUUBI_HOME/tools/spark-block-cleaner/target/spark-block-cleaner_${SCALA_VERSION}-${VERSION}.jar" ]]; then
mkdir -p "$DISTDIR/tools/spark-block-cleaner/kubernetes"
mkdir -p "$DISTDIR/tools/spark-block-cleaner/jars"
cp -r "$KYUUBI_HOME"/tools/spark-block-cleaner/kubernetes/* "$DISTDIR/tools/spark-block-cleaner/kubernetes/"
cp "$KYUUBI_HOME/tools/spark-block-cleaner/target/spark-block-cleaner_${SCALA_VERSION}-${VERSION}.jar" "$DISTDIR/tools/spark-block-cleaner/jars/"
fi

# Copy Kyuubi Spark extension
SPARK_EXTENSION_VERSIONS=('3-1' '3-2' '3-3' '3-4' '3-5')
# shellcheck disable=SC2068
Expand Down
2 changes: 1 addition & 1 deletion dev/reformat
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ set -x

KYUUBI_HOME="$(cd "`dirname "$0"`/.."; pwd)"

PROFILES="-Pflink-provided,hive-provided,spark-provided,spark-block-cleaner,spark-3.5,spark-3.4,spark-3.3,spark-3.2,spark-3.1,tpcds,kubernetes-it"
PROFILES="-Pflink-provided,hive-provided,spark-provided,spark-3.5,spark-3.4,spark-3.3,spark-3.2,spark-3.1,tpcds,kubernetes-it"

# python style checks rely on `black` in path
if ! command -v black &> /dev/null
Expand Down
118 changes: 2 additions & 116 deletions docs/tools/spark_block_cleaner.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,119 +17,5 @@

# Kubernetes Tools Spark Block Cleaner

## Requirements

You'd better have cognition upon the following things when you want to use spark-block-cleaner.

* Read this article
* An active Kubernetes cluster
* [Kubectl](https://kubernetes.io/docs/reference/kubectl/overview/)
* [Docker](https://www.docker.com/)

## Scenes

When you're using Spark On Kubernetes with Client mode and don't use `emptyDir` for Spark `local-dir` type, you may face the same scenario that executor pods deleted without clean all the Block files. It may cause disk overflow.

Therefore, we chose to use Spark Block Cleaner to clear the block files accumulated by Spark.

## Principle

When deploying Spark Block Cleaner, we will configure volumes for the destination folder. Spark Block Cleaner will perceive the folder by the parameter `CACHE_DIRS`.

Spark Block Cleaner will clear the perceived folder in a fixed loop(which can be configured by `SCHEDULE_INTERVAL`). And Spark Block Cleaner will select folder start with `blockmgr` and `spark` for deletion using the logic Spark uses to create those folders.

Before deleting those files, Spark Block Cleaner will determine whether it is a recently modified file(depending on whether the file has not been acted on within the specified time which configured by `FILE_EXPIRED_TIME`). Only delete files those beyond that time interval.

And Spark Block Cleaner will check the disk utilization after clean, if the remaining space is less than the specified value(control by `FREE_SPACE_THRESHOLD`), will trigger deep clean(which file expired time control by `DEEP_CLEAN_FILE_EXPIRED_TIME`).

## Usage

Before you start using Spark Block Cleaner, you should build its docker images.

### Build Block Cleaner Docker Image

In the `KYUUBI_HOME` directory, you can use the following cmd to build docker image.

```shell
docker build ./tools/spark-block-cleaner/kubernetes/docker
```

### Modify spark-block-cleaner.yml

You need to modify the `${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml` to fit your current environment.

In Kyuubi tools, we recommend using `DaemonSet` to start, and we offer default yaml file in daemonSet way.

Base file structure:

```yaml
apiVersion
kind
metadata
name
namespace
spec
select
template
metadata
spce
containers
- image
- volumeMounts
- env
volumes
```

You can use affect the performance of Spark Block Cleaner through configure parameters in containers env part of `spark-block-cleaner.yml`.

```yaml
env:
- name: CACHE_DIRS
value: /data/data1,/data/data2
- name: FILE_EXPIRED_TIME
value: 604800
- name: DEEP_CLEAN_FILE_EXPIRED_TIME
value: 432000
- name: FREE_SPACE_THRESHOLD
value: 60
- name: SCHEDULE_INTERVAL
value: 3600
```
The most important thing, configure volumeMounts and volumes corresponding to Spark local-dirs.
For example, Spark use /spark/shuffle1 as local-dir, you can configure like:
```yaml
volumes:
- name: block-files-dir-1
hostPath:
path: /spark/shuffle1
```
```yaml
volumeMounts:
- name: block-files-dir-1
mountPath: /data/data1
```
```yaml
env:
- name: CACHE_DIRS
value: /data/data1
```
### Start daemonSet
After you finishing modifying the above, you can use the following command `kubectl apply -f ${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml` to start daemonSet.

## Related parameters

| Name | Default | unit | Meaning |
|------------------------------|-------------------------|---------|-----------------------------------------------------------------------------------------------------------------------|
| CACHE_DIRS | /data/data1,/data/data2 | | The target dirs in container path which will clean block files. |
| FILE_EXPIRED_TIME | 604800 | seconds | Cleaner will clean the block files which current time - last modified time more than the fileExpiredTime. |
| DEEP_CLEAN_FILE_EXPIRED_TIME | 432000 | seconds | Deep clean will clean the block files which current time - last modified time more than the deepCleanFileExpiredTime. |
| FREE_SPACE_THRESHOLD | 60 | % | After first clean, if free Space low than threshold trigger deep clean. |
| SCHEDULE_INTERVAL | 3600 | seconds | Cleaner sleep between cleaning. |

**Note**:
This tool has been removed since Kyuubi 1.9.0.
7 changes: 0 additions & 7 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2376,13 +2376,6 @@
</properties>
</profile>

<profile>
<id>spark-block-cleaner</id>
<modules>
<module>tools/spark-block-cleaner</module>
</modules>
</profile>

<profile>
<id>spotless-python</id>
<properties>
Expand Down
34 changes: 0 additions & 34 deletions tools/spark-block-cleaner/kubernetes/docker/Dockerfile

This file was deleted.

23 changes: 0 additions & 23 deletions tools/spark-block-cleaner/kubernetes/docker/entrypoint.sh

This file was deleted.

75 changes: 0 additions & 75 deletions tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml

This file was deleted.

53 changes: 0 additions & 53 deletions tools/spark-block-cleaner/pom.xml

This file was deleted.

Loading

0 comments on commit c6bba91

Please sign in to comment.