Skip to content

Conversation

@erranlli
Copy link
Contributor

@erranlli erranlli commented Oct 31, 2025

Problem

When training with the deepcoder dataset (or any dataset with nested structures), the following error occurred during dataset loading in distributed Ray contexts:
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

This prevented the training pipeline from starting, as Ray workers couldn't load the Parquet dataset files.

Root Cause

The apply_verl_postprocessing() method was creating dataset entries with nested structures:

  • prompt: list of dicts [{"role": "user", "content": "..."}]
  • reward_model: nested dict {"style": "rule", "ground_truth": None}
  • extra_info: nested dict containing the original dataset entry

When these nested structures were saved to Parquet and later loaded by PyArrow in a distributed context (using chunked arrays), PyArrow failed because it cannot handle nested data conversions for chunked array outputs.

Additionally, HuggingFace datasets often contain NumPy types (numpy.int64, numpy.ndarray, etc.) which are not JSON-serializable, causing secondary serialization errors.

Solution

Modified rllm/rllm/data/dataset.py to flatten nested structures before saving to Parquet:

1. Added _convert_to_json_serializable() helper method

Recursively converts non-serializable types:

  • numpy.ndarray → Python list
  • numpy.int64/int32/etc. → Python int
  • numpy.float64/float32/etc. → Python float
  • Recursively processes nested dicts and lists

2. Updated apply_verl_postprocessing()

  • Converts numpy types to Python native types first
  • JSON-serializes all nested structures (prompt, reward_model, extra_info) to strings
  • Parquet files now contain only flat string fields, avoiding PyArrow limitations
  • Application code deserializes JSON strings when needed

Changes

File: rllm/rllm/data/dataset.py

+ import numpy as np

+ @staticmethod
+ def _convert_to_json_serializable(obj: Any) -> Any:
+     """Convert numpy arrays and other non-serializable objects to JSON-serializable types."""
+     if isinstance(obj, np.ndarray):
+         return obj.tolist()
+     elif isinstance(obj, (np.integer, np.floating)):
+         return obj.item()
+     elif isinstance(obj, dict):
+         return {key: DatasetRegistry._convert_to_json_serializable(value) for key, value in obj.items()}
+     elif isinstance(obj, (list, tuple)):
+         return [DatasetRegistry._convert_to_json_serializable(item) for item in obj]
+     else:
+         return obj

  @classmethod
  def apply_verl_postprocessing(cls, data: list[dict[str, Any]]) -> list[dict[str, Any]]:
      """Apply Verl postprocessing to the dataset."""
      processed_data = []
      for entry in data:
+         # Convert numpy arrays to lists before JSON serialization
+         serializable_entry = cls._convert_to_json_serializable(entry)
+         
          processed_entry = {
-             "prompt": [{"role": "user", "content": "placeholder"}],
-             "reward_model": {"style": "rule", "ground_truth": None},
-             "extra_info": entry,
+             # Serialize nested structures as JSON strings to avoid PyArrow chunked array issues
+             "prompt": json.dumps([{"role": "user", "content": "placeholder"}]),
+             "reward_model": json.dumps({"style": "rule", "ground_truth": None}),
+             "extra_info": json.dumps(serializable_entry),
          }
          processed_data.append(processed_entry)
      return processed_data

Testing

  • ✅ Created comprehensive tests verifying numpy array conversion
  • ✅ Verified nested structures are properly JSON-serialized
  • ✅ Confirmed Parquet save/load works in both single and distributed contexts
  • ✅ Validated data integrity after round-trip serialization
  • ✅ Successfully prepared deepcoder dataset (24,287 train + 687 test examples)
  • ✅ Training pipeline now starts without PyArrow errors

Trade-offs

  • Slight performance overhead from JSON encode/decode operations
  • This is necessary for PyArrow compatibility and enables reliable distributed training
  • The overhead is minimal compared to the training compute time

Impact

  • Fixes: Blocking issue preventing deepcoder training from starting
  • Enables: Distributed training with complex nested dataset structures
  • Prevents: PyArrow chunked array errors across all datasets with nested structures.

@erranlli erranlli changed the title fix: Gracefully skip overlong prompts during training to prevent crashes Fix: Resolve PyArrow nested data conversion error in distributed dataset loading Oct 31, 2025
@jeffreysijuntan jeffreysijuntan merged commit 2956f86 into rllm-org:main Oct 31, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants