Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to load dataset #21

Open
obananas opened this issue Feb 10, 2025 · 3 comments
Open

Failed to load dataset #21

obananas opened this issue Feb 10, 2025 · 3 comments
Assignees

Comments

@obananas
Copy link

obananas commented Feb 10, 2025

By running: load_dataset("allenai/OLMoE-mix-0924")
I got the following error: (I also tried re-download the dataset but it gave me the same error)

[Error code: DatasetGenerationCastError
Exception: DatasetGenerationCastError
Message: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})

This happened while the json dataset builder was generating data using

gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz

Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
writer.write_table(table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 622, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
added: string
attributes: struct<paloma_paragraphs: list<item: list<item: int64>>>
child 0, paloma_paragraphs: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
created: string
doc: struct<arxiv_id: string, language: string, timestamp: timestamp[s], url: string, yymm: string>
child 0, arxiv_id: string
child 1, language: string
child 2, timestamp: timestamp[s]
child 3, url: string
child 4, yymm: string
id: string
metadata: struct<provenance: string>
child 0, provenance: string
text: string
to
{'id': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'added': Value(dtype='string', id=None), 'created': Value(dtype='string', id=None)}
because column names don't match

During handling of the above exception, another exception occurred:

          Traceback (most recent call last):
            File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1412, in compute_config_parquet_and_info_response
              parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
            File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 988, in stream_convert_to_parquet
              builder._prepare_split(
            File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1741, in _prepare_split
              for job_id, done, content in self._prepare_split_single(
            File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1872, in _prepare_split_single
              raise DatasetGenerationCastError.from_cast_error(
          datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
          
          All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})
          
          This happened while the json dataset builder was generating data using
          
          gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz
          
          Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)]
@Muennighoff
Copy link
Collaborator

Is the recommended solution to load each part separately via the files argument or something; or is it a bug we should fix? cc @soldni @kyleclo in case you know!

@Muennighoff
Copy link
Collaborator

@obananas were you able to resolve this issue?

@obananas
Copy link
Author

I have changed to use snapshot_download function from huggingface_hub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants