Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281

erranlli · 2025-10-31T10:26:08Z

Problem

When training with the deepcoder dataset (or any dataset with nested structures), the following error occurred during dataset loading in distributed Ray contexts:
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

This prevented the training pipeline from starting, as Ray workers couldn't load the Parquet dataset files.

Root Cause

The apply_verl_postprocessing() method was creating dataset entries with nested structures:

prompt: list of dicts [{"role": "user", "content": "..."}]
reward_model: nested dict {"style": "rule", "ground_truth": None}
extra_info: nested dict containing the original dataset entry

When these nested structures were saved to Parquet and later loaded by PyArrow in a distributed context (using chunked arrays), PyArrow failed because it cannot handle nested data conversions for chunked array outputs.

Additionally, HuggingFace datasets often contain NumPy types (numpy.int64, numpy.ndarray, etc.) which are not JSON-serializable, causing secondary serialization errors.

Solution

Modified rllm/rllm/data/dataset.py to flatten nested structures before saving to Parquet:

1. Added `_convert_to_json_serializable()` helper method

Recursively converts non-serializable types:

numpy.ndarray → Python list
numpy.int64/int32/etc. → Python int
numpy.float64/float32/etc. → Python float
Recursively processes nested dicts and lists

2. Updated `apply_verl_postprocessing()`

Converts numpy types to Python native types first
JSON-serializes all nested structures (prompt, reward_model, extra_info) to strings
Parquet files now contain only flat string fields, avoiding PyArrow limitations
Application code deserializes JSON strings when needed

Changes

File: rllm/rllm/data/dataset.py

+ import numpy as np

+ @staticmethod
+ def _convert_to_json_serializable(obj: Any) -> Any:
+     """Convert numpy arrays and other non-serializable objects to JSON-serializable types."""
+     if isinstance(obj, np.ndarray):
+         return obj.tolist()
+     elif isinstance(obj, (np.integer, np.floating)):
+         return obj.item()
+     elif isinstance(obj, dict):
+         return {key: DatasetRegistry._convert_to_json_serializable(value) for key, value in obj.items()}
+     elif isinstance(obj, (list, tuple)):
+         return [DatasetRegistry._convert_to_json_serializable(item) for item in obj]
+     else:
+         return obj

  @classmethod
  def apply_verl_postprocessing(cls, data: list[dict[str, Any]]) -> list[dict[str, Any]]:
      """Apply Verl postprocessing to the dataset."""
      processed_data = []
      for entry in data:
+         # Convert numpy arrays to lists before JSON serialization
+         serializable_entry = cls._convert_to_json_serializable(entry)
+         
          processed_entry = {
-             "prompt": [{"role": "user", "content": "placeholder"}],
-             "reward_model": {"style": "rule", "ground_truth": None},
-             "extra_info": entry,
+             # Serialize nested structures as JSON strings to avoid PyArrow chunked array issues
+             "prompt": json.dumps([{"role": "user", "content": "placeholder"}]),
+             "reward_model": json.dumps({"style": "rule", "ground_truth": None}),
+             "extra_info": json.dumps(serializable_entry),
          }
          processed_data.append(processed_entry)
      return processed_data

Testing

✅ Created comprehensive tests verifying numpy array conversion
✅ Verified nested structures are properly JSON-serialized
✅ Confirmed Parquet save/load works in both single and distributed contexts
✅ Validated data integrity after round-trip serialization
✅ Successfully prepared deepcoder dataset (24,287 train + 687 test examples)
✅ Training pipeline now starts without PyArrow errors

Trade-offs

Slight performance overhead from JSON encode/decode operations
This is necessary for PyArrow compatibility and enables reliable distributed training
The overhead is minimal compared to the training compute time

Impact

Fixes: Blocking issue preventing deepcoder training from starting
Enables: Distributed training with complex nested dataset structures
Prevents: PyArrow chunked array errors across all datasets with nested structures.

fix: Gracefully skip overlong prompts during training to prevent crashes

967c25a

erranlli changed the title ~~fix: Gracefully skip overlong prompts during training to prevent crashes~~ Fix: Resolve PyArrow nested data conversion error in distributed dataset loading Oct 31, 2025

jeffreysijuntan merged commit 2956f86 into rllm-org:main Oct 31, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281

Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281

Uh oh!

erranlli commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281

Fix: Resolve PyArrow nested data conversion error in distributed dataset loading #281

Uh oh!

Conversation

erranlli commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

1. Added _convert_to_json_serializable() helper method

2. Updated apply_verl_postprocessing()

Changes

Testing

Trade-offs

Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erranlli commented Oct 31, 2025 •

edited

Loading

1. Added `_convert_to_json_serializable()` helper method

2. Updated `apply_verl_postprocessing()`