Failed to load dataset #21

obananas · 2025-02-10T06:45:19Z

By running: load_dataset("allenai/OLMoE-mix-0924")
I got the following error: (I also tried re-download the dataset but it gave me the same error)

[Error code: DatasetGenerationCastError
Exception: DatasetGenerationCastError
Message: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})

This happened while the json dataset builder was generating data using

gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
writer.write_table(table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 622, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
added: string
attributes: struct<paloma_paragraphs: list<item: list<item: int64>>>
child 0, paloma_paragraphs: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
created: string
doc: struct<arxiv_id: string, language: string, timestamp: timestamp[s], url: string, yymm: string>
child 0, arxiv_id: string
child 1, language: string
child 2, timestamp: timestamp[s]
child 3, url: string
child 4, yymm: string
id: string
metadata: struct<provenance: string>
child 0, provenance: string
text: string
to
{'id': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'added': Value(dtype='string', id=None), 'created': Value(dtype='string', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

          Traceback (most recent call last):
            File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1412, in compute_config_parquet_and_info_response
              parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
            File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 988, in stream_convert_to_parquet
              builder._prepare_split(
            File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1741, in _prepare_split
              for job_id, done, content in self._prepare_split_single(
            File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1872, in _prepare_split_single
              raise DatasetGenerationCastError.from_cast_error(
          datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
          
          All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})
          
          This happened while the json dataset builder was generating data using
          
          gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz
          
          Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)]

The text was updated successfully, but these errors were encountered:

Muennighoff · 2025-02-13T03:57:11Z

Is the recommended solution to load each part separately via the files argument or something; or is it a bug we should fix? cc @soldni @kyleclo in case you know!

Muennighoff · 2025-02-19T17:05:10Z

@obananas were you able to resolve this issue?

obananas · 2025-02-20T02:58:46Z

I have changed to use snapshot_download function from huggingface_hub.

Muennighoff assigned soldni Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to load dataset #21

Failed to load dataset #21

obananas commented Feb 10, 2025 •

edited

Loading

Muennighoff commented Feb 13, 2025

Muennighoff commented Feb 19, 2025

obananas commented Feb 20, 2025

Failed to load dataset #21

Failed to load dataset #21

Comments

obananas commented Feb 10, 2025 • edited Loading

Muennighoff commented Feb 13, 2025

Muennighoff commented Feb 19, 2025

obananas commented Feb 20, 2025

obananas commented Feb 10, 2025 •

edited

Loading