'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Hello, there. I downloaded a dataset in Hub and saved it to a local folder. However I could not reload it. Here is my code

from datasets import load_dataset
raw_datasets = load_dataset("roneneldan/TinyStories")
raw_datasets = load_dataset('text', data_dir = "Tiny_Stories")

Here is my error.

UnicodeDecodeError                        Traceback (most recent call last)
File c:\Users\Tom W\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:1925, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1924 _time = time.time()
-> 1925 for _, table in generator:
   1926     if max_shard_size is not None and writer._num_bytes > max_shard_size:

File c:\Users\Tom W\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\packaged_modules\text\text.py:89, in Text._generate_tables(self, files)
     88 while True:
---> 89     batch = f.read(self.config.chunksize)
     90     if not batch:

File c:\Users\Tom W\AppData\Local\Programs\Python\Python310\lib\codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
    323 # keep undecoded input until the next call

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
d:\16ComputerScience\Pattern_Recongnition_and_Machine_Learning\Deeplearning_research_oriented\Final_project\Transfromer_Learn\Tokenizer.ipynb Cell 11 in ()
      1 from datasets import load_dataset
----> 3 raw_datasets = load_dataset('text', data_dir = "Tiny_Stories")
   1957         e = e.__context__
-> 1958     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1960 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Anyone can help? :upside_down_face:

Hi! save_to_disk doesn’t save a dataset as a text file (writes Arrow and JSON metadata files), hence the error. Instead, use load_from_disk("Tiny_Stories") to load the dataset.

Thank you very much!!! Problem solved! Your answer really helps me a lot!

Thank you very much! It is the problem of ‘file formats’. That really helps me a lot!