Can't locate the error in my dataset

I always get the following error when I try to download (or preview) the dataset arnastofnun/test.

Traceback (most recent call last):
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/builder.py”, line 1869, in _prepare_split_single
writer.write_table(table)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/arrow_writer.py”, line 580, in write_table
pa_table = table_cast(pa_table, self._schema)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2292, in table_cast
return cast_table_to_schema(table, schema)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2245, in cast_table_to_schema
arrays = [
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2246, in
cast_array_to_feature(
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 1795, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 1795, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2005, in cast_array_to_feature
arrays = [
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2006, in
_c(array.field(name) if name in array_fields else null_array, subfeature)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 1797, in wrapper
return func(array, *args, **kwargs)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 2102, in cast_array_to_feature
return array_cast(
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 1797, in wrapper
return func(array, *args, **kwargs)
File “/home/starkadur/.local/lib/python3.8/site-packages/datasets/table.py”, line 1948, in array_cast
raise TypeError(f"Couldn’t cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn’t cast array of type string to null

I have tried to split the dataset up into smaller bits to find out where in the data the problem lies, but if I only upload the first half of the dataset, I don’t get any error, and if I only upload the second half of the data set, I don’t get any error either. So I can’t know where the problem lies in the data. All help is appreciated.

It appears to be a complicated issue that is probably unresolved. A workaround seems to be to pass the dataset as JSON.

Hi,

This is because the author has just uploaded a zip file to the dataset repo: arnastofnun/test at main, hence one would need to unzip it locally. Here’s how to download the file:

from huggingface_hub import hf_hub_download

filepath = hf_hub_download(repo_id="arnastofnun/test", filename="igc_news1_eidfaxi.zip", repo_type="dataset")
1 Like

Thank your for your answer.
I have tried to upload the file unzipped. I have deleted the zip-file and uploaded the jsonl-file, but the error remains. Actually this dataset is a snippet of a much larger dataset (arnastofnun/IGC-2022-1) with tens of much bigger zip-files, and there I only have this problem with three of the files. So it seems that the zip-files are not the problem.