Compacting multiple .parquet
files using Ray
Improving Read/write performance by reducing the overall number of files
Code for compaction can be found in Ray-Only.py
.
Required packages are listed in Requirements.txt
.
Ensure pip is upto date, and installs the dependencies in Requirements.txt.
pip install --upgrade pip
!pip install -r Requirements.txt
Check whether AWS credentials are correct
!aws configure list
Import packages
from ray.util import inspect_serializability
import ray
import pyarrow.fs as pq
import pandas as pd
Create a filesystem object which can specify the access key, secret key and a custom endpoint. Can be skipped if endpoint needn't be specified and credentials are present in awscli.
fs_pyarrow = pq.S3FileSystem(endpoint_override=<your-custom-endpoint>)
Read parquet files from remote path
df = ray.data.read_parquet(paths=path/to/parquet, filesystem=fs_pyarrow)
Note: Specifying paths to multiple directories does not work with read_parquet
. Issue and workaround: ray-project/ray#24598
Write back a single parquet file. .repartition(<number-of-files>)
is where number of files can be specified.
df.repartition(1).write_parquet(path=path/to/destination, filesystem=fs_pyarrow)
Setup a Ray cluster on OpenShift, and add a JupyterHub notebook image that can connect to it.
Refer this PR: opendatahub-io/odh-manifests#573
Running the kustomize script sets up the Ray operator, and adds the ray-ml-notebook
image to ODH JupyterHub.
You can access the Ray dashboard via OpenShift: Networking -> Routes. The process followed and benchmarking result can be found here: https://docs.google.com/document/d/1jNP8azr3v3yRjtoT3uV5qKxdG9TNo91hRKqwrzUWpvM/edit?usp=sharing