Load_dataset error (.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow')


I’m following a tutorial to fine-tune a model, but have been stuck in a load_dataset error I can’t solve. For context, the tutorial first uploaded this dataset to HF, and I managed to upload an identical one.

When I run a script to download the dataset, however, the problem appears. If I’m downloading the original dataset, the process goes well and all files are fetched correctly. But when I try downloading mine, it seems like I get to download part of the files (until the 0.0.0 folder you’ll see in the error message, but nothing after that).

The command I’m running is dataset = load_dataset("FelipeBandeiraPoatek/invoices-donut-data-v2", split="train"), and the error log I’m getting is the following:

Downloading data files: 100%|████████████████████████████████████████████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|████████████████████████████████████████| 3/3 [00:00<00:00, 198.67it/s] 
Traceback (most recent call last):
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1852, in _prepare_split_single
    writer = writer_class(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\arrow_writer.py", line 334, in __init__
    self.stream = self._fs.open(fs_token_paths[2][0], "wb")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\spec.py", line 1241, in open
    f = self._open(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 184, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 315, in __init__
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\fsspec\implementations\local.py", line 320, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Felipe Bandeira/.cache/huggingface/datasets/FelipeBandeiraPoatek___parquet/FelipeBandeiraPoatek--invoices-donut-data-v2-ca49e83826870faf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete/parquet-validation-00000-00000-of-NNNNN.arrow'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 11, in <module>
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\run_donut_test.py", line 7, in main   
  File "c:\Users\Felipe Bandeira\Desktop\sparrow\sparrow-data\tools\donut\dataset_tester.py", line 10, in test
    dataset = load_dataset(dataset_name, split="train")
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\load.py", line 1782, in load_dataset
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 872, in download_and_prepare
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 967, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1749, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\Users\Felipe Bandeira\Desktop\sparrow\venv\lib\site-packages\datasets\builder.py", line 1892, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

I haven’t found any solutions for this and cannot figure out why the original dataset is downloaded well, but mine (which is identical) does not. Any clues?

I have a similar issue with RaphaelOlivier/whisper_adversarial_examples · Datasets at Hugging Face on Windows, but no issues on WSL. So it looks to me like a Windows vs Linux problem in the datasets library, but I haven’t got any further yet.