Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When training with the deepcoder dataset (or any dataset with nested structures), the following error occurred during dataset loading in distributed Ray contexts:
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs
This prevented the training pipeline from starting, as Ray workers couldn't load the Parquet dataset files.
Root Cause
The
apply_verl_postprocessing()method was creating dataset entries with nested structures:prompt: list of dicts[{"role": "user", "content": "..."}]reward_model: nested dict{"style": "rule", "ground_truth": None}extra_info: nested dict containing the original dataset entryWhen these nested structures were saved to Parquet and later loaded by PyArrow in a distributed context (using chunked arrays), PyArrow failed because it cannot handle nested data conversions for chunked array outputs.
Additionally, HuggingFace datasets often contain NumPy types (
numpy.int64,numpy.ndarray, etc.) which are not JSON-serializable, causing secondary serialization errors.Solution
Modified
rllm/rllm/data/dataset.pyto flatten nested structures before saving to Parquet:1. Added
_convert_to_json_serializable()helper methodRecursively converts non-serializable types:
numpy.ndarray→ Pythonlistnumpy.int64/int32/etc.→ Pythonintnumpy.float64/float32/etc.→ Pythonfloat2. Updated
apply_verl_postprocessing()prompt,reward_model,extra_info) to stringsChanges
File:
rllm/rllm/data/dataset.pyTesting
Trade-offs
Impact