You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
writer.write_table(table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 622, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
added: string
attributes: struct<paloma_paragraphs: list<item: list<item: int64>>>
child 0, paloma_paragraphs: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
created: string
doc: struct<arxiv_id: string, language: string, timestamp: timestamp[s], url: string, yymm: string>
child 0, arxiv_id: string
child 1, language: string
child 2, timestamp: timestamp[s]
child 3, url: string
child 4, yymm: string
id: string
metadata: struct<provenance: string>
child 0, provenance: string
text: string
to
{'id': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'added': Value(dtype='string', id=None), 'created': Value(dtype='string', id=None)}
because column names don't match
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1412, in compute_config_parquet_and_info_response
parquet_operations, partial, estimated_dataset_info = stream_convert_to_parquet(
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 988, in stream_convert_to_parquet
builder._prepare_split(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1741, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1872, in _prepare_split_single
raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})
This happened while the json dataset builder was generating data using
gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)]
The text was updated successfully, but these errors were encountered:
Is the recommended solution to load each part separately via the files argument or something; or is it a bug we should fix? cc @soldni@kyleclo in case you know!
By running: load_dataset("allenai/OLMoE-mix-0924")
I got the following error: (I also tried re-download the dataset but it gave me the same error)
[Error code: DatasetGenerationCastError
Exception: DatasetGenerationCastError
Message: An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 3 new columns ({'doc', 'metadata', 'attributes'})
This happened while the json dataset builder was generating data using
gzip://algebraic-stack-train-0000.json::hf://datasets/allenai/OLMoE-mix-0924@1e44595eaffc7491dfab23947ea4d5a62b33aff3/data/algebraic-stack/algebraic-stack-train-0000.json.gz
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1870, in _prepare_split_single
writer.write_table(table)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 622, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2292, in table_cast
return cast_table_to_schema(table, schema)
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/table.py", line 2240, in cast_table_to_schema
raise CastError(
datasets.table.CastError: Couldn't cast
added: string
attributes: struct<paloma_paragraphs: list<item: list<item: int64>>>
child 0, paloma_paragraphs: list<item: list<item: int64>>
child 0, item: list<item: int64>
child 0, item: int64
created: string
doc: struct<arxiv_id: string, language: string, timestamp: timestamp[s], url: string, yymm: string>
child 0, arxiv_id: string
child 1, language: string
child 2, timestamp: timestamp[s]
child 3, url: string
child 4, yymm: string
id: string
metadata: struct<provenance: string>
child 0, provenance: string
text: string
to
{'id': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'added': Value(dtype='string', id=None), 'created': Value(dtype='string', id=None)}
because column names don't match
During handling of the above exception, another exception occurred:
The text was updated successfully, but these errors were encountered: