Error when downloading own dataset with git lfs files

mkon · June 16, 2022, 2:11pm

I’ve made a couple of data loaders for different stance detection datasets. For this I’ve stored the .csv files used with git lfs because one of them surpassed the 10MB maximum. which can be seen here. When trying to download the dataset in my own code using load_dataset("strombergnlp/x-stance") I get this error and I’m unsure why. It’s worth noting that the one dataset I’ve imported to huggingface which doesn’t use git lfs works perfectly fine when downloaded.

Traceback (most recent call last):
File “”, line 1, in
File “C:\Users\Arche\miniconda3\lib\site-packages\datasets\load.py”, line 1679, in load_dataset
builder_instance.download_and_prepare(
File “C:\Users\Arche\miniconda3\lib\site-packages\datasets\builder.py”, line 704, in download_and_prepare
self._download_and_prepare(
File “C:\Users\Arche\miniconda3\lib\site-packages\datasets\builder.py”, line 775, in _download_and_prepare
verify_checksums(
File “C:\Users\Arche\miniconda3\lib\site-packages\datasets\utils\info_utils.py”, line 33, in verify_checksums
raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
datasets.utils.info_utils.ExpectedMoreDownloadedFiles: {‘x-stance-valid-de.csv’, ‘x-stance-test-de.csv’, ‘x-stance-train-de.csv’}

Any pointers as to what could be wrong are greatly appreciated

polinaeterna · June 21, 2022, 10:41am

Hi! Can you try using load_dataset with ignore_verifications=True option?

mkon · June 21, 2022, 12:41pm

Hi! Thanks for the response. With ignore_verifications=True I get this new error:

Traceback (most recent call last):
File “stance_detection.py”, line 167, in run
dataset_de = load_dataset(‘strombergnlp/x-stance’, ‘de’, ignore_verifications=True)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/load.py”, line 1736, in load_dataset
use_auth_token=use_auth_token,
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/builder.py”, line 614, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/builder.py”, line 702, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/builder.py”, line 1167, in _prepare_split
writer.write_table(table)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/arrow_writer.py”, line 523, in write_table
pa_table = table_cast(pa_table, self._schema)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1895, in table_cast
return cast_table_to_schema(table, schema)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1878, in cast_table_to_schema
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1878, in
arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1673, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1673, in
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1843, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1675, in wrapper
return func(array, *args, **kwargs)
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/datasets/table.py”, line 1752, in array_cast
return array.cast(pa_type)
File “pyarrow/array.pxi”, line 915, in pyarrow.lib.Array.cast
File “/home/mkon/miniconda3/envs/stance/lib/python3.7/site-packages/pyarrow/compute.py”, line 376, in cast
return call_function(“cast”, [arr], options)
File “pyarrow/_compute.pyx”, line 542, in pyarrow._compute.call_function
File “pyarrow/_compute.pyx”, line 341, in pyarrow._compute.Function.call
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: ‘AGAINST’ as a scalar of type int64

Here, AGAINST is one of the two class labels for the dataloader, the other being FAVOR. The dataloader passes the test suite so unsure why it crashes here.

mariosasko · June 22, 2022, 10:32am

Hi! The script name should match the repo name, so can you please rename the script to x-stance.py and regenerate the dataset_infos.json file? The loading machinery currently loads it as a CSV module due to the name mismatch:

>>> from datasets import load_dataset_builder
>>> load_dataset_builder("strombergnlp/x-stance")
Using custom data configuration de-468d65e3f7ce1e40
<datasets.packaged_modules.csv.csv.Csv object at 0x000001E3A74F1248>

(It first checks if x-stance.py is present and, if not, infers the most common data file type, which is CSV)

mkon · June 22, 2022, 9:32pm

Thank you! That seemed to fix the error. I’ve made two other dataloaders with the exact same problem, who all use git lfs, so I thought that was where the problem was. But they’re all also incorrectly named with an underscore and not a dash, so I’ll go ahead and fix those as well.

Thank you for all the help!

Topic		Replies	Views
Use Git to download datasets but fails to load 🤗Datasets	4	1589	March 8, 2024
Error loading dataset 🤗Datasets	7	7260	May 13, 2022
Traceback while loading image dataset 🤗Datasets	1	653	July 20, 2022
Load_dataset File not found error when specifying commit id 🤗Datasets	0	113	May 29, 2024
Load_dataset error (.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow') 🤗Datasets	1	797	February 12, 2024

Error when downloading own dataset with git lfs files

Related topics