Datasets.load_datasets fails

yashara · October 4, 2024, 10:46pm

I am trying to load a dataset from the openx datasets using a command like this:

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
)

But it’s failing on any argument configurations. The summarized error stack is:

File “~/.cache/huggingface/modules/datasets_modules/datasets/jxu124–OpenX-Embodiment/317e9044a9bb97bb1db9ea5aebf1c15f5cc3e1e071e5da025e97892e96dae22b/OpenX-Embodiment.py”, line 29, in decode_image
data = data.decode()
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

The above exception was the direct cause of the following exception:
.
.
.
File “…/lib/python3.10/site-packages/datasets/builder.py”, line 1642, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Looks like the API is buggy for this specific dataset. Does anyone know how to successfully load data from this dataset suit? Thanks.

John6666 · October 4, 2024, 11:47pm

Character code errors still occur in 2024…
Apparently there are cases where it can be avoided by explicitly specifying it at load time.
If this does not work, there may be another cause.

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-16",
)

or

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-8",
)

yashara · October 5, 2024, 12:43am

Thanks for the reply, @John6666.
Checking the dataset.load_dataset(), I don’t see any ‘encoding’ argument option. Setting it triggers such errors:

ValueError: BuilderConfig OpenXConfig (name='berkeley_gnm_cory_hall', version=0.0.0, data_dir=None, data_files=None, description=None, features=None) doesn't have a 'encoding' key

Other problem when tracing the error stack at OpenX-Embodiment.py", line 30: data = data.decode(), I can see the data is not an image, but some bogus values like

b'/home/jonathan/dev/gnm_dataset/cory_hall/cory5_aug17_00_0546'

So, make me thinking if the API is not downloading the data properly.

John6666 · October 5, 2024, 12:56am

The referenced article is from 2023, so maybe the argument is obsolete…
Anyway, I agree with you that just because the error message is UTF-8 related does not mean that it is a character code issue.
I hope it’s a problem that can be avoided by modifying the options or code without changing the dataset.
It could even be some kind of bug that has never been fixed because it can only be avoided by changing the dataset.

John6666 · October 5, 2024, 1:03am

I’m sure you’ve done this long ago, but just a spell.

pip uninstall datasets
pip install git+https://github.com/huggingface/datasets.git

This is an article from April of this year, but it seems that some people are using an unusually old datasets library, perhaps pulled in by some library dependency.

yashara · October 5, 2024, 5:16pm

Thanks for the suggestions, @John6666 . I tried changing versions, but doesn’t look like working. Maybe it’s because the code is not well-maintained &/ dataset formats have become incompatible.

John6666 · October 5, 2024, 9:06pm

@lhoestq There is an occasional problem with some datasets not loading properly in the datasets library. On the surface this appears as a UTF-8 error, but the actual cause is unknown. Is this fixable?

lhoestq · October 7, 2024, 12:54pm

We’re using PyArrow to load the data, maybe we can see with them how to improve it. Do you have a reproducible example ?

John6666 · October 7, 2024, 11:02pm

I would have to ask yashara to know exactly, but perhaps this data set can reproduce it. Also, I think I’ve seen some of these in past forums and github posts, so I’ll look for them later.

import datasets
dataset = datasets.load_dataset(
    “jxu124/OpenX-Embodiment”,
    “berkeley_gnm_cory_hall”,
    streaming=False,
    split=“train”,
    cache_dir=ds_root,
    trust_remote_code=True,
)

lhoestq · October 11, 2024, 1:56pm

Do you have an example that doesn’t use trust_remote_code ? We stopped developing that option because it’s not great for obvious security reasons.

Also note that datasets in WebDataset format are supported out-of-the-box so trust_remote_code / having a loading script is not needed in that case.

John6666 · October 11, 2024, 2:00pm

Do you have an example that doesn’t use trust_remote_code ?

2023, but his code doesn’t seem to trust_remote_code.

from datasets import load_dataset
raw_datasets = load_dataset("roneneldan/TinyStories")
raw_datasets.save_to_disk("Tiny_Stories")
raw_datasets = load_dataset('text', data_dir = "Tiny_Stories")

lhoestq · October 11, 2024, 2:16pm

You should use load_form_disk after save_to_disk (it’s a serialized form for local disk - not optimized for online sharing but faster to reload locally):

from datasets import load_dataset, load_from_disk
raw_datasets = load_dataset("roneneldan/TinyStories")
raw_datasets.save_to_disk("Tiny_Stories")
raw_datasets = load_from_disk("Tiny_Stories")

John6666 · October 11, 2024, 2:31pm

Ah. That was a completely different error.
That’s all the errors I’ve seen in my searches. Here is an example that was addressed by an option added in a library upgrade. The rest of the problems don’t seem to exist.

github.com/huggingface/datasets

unicodedecodeerror: 'utf-8' codec can't decode byte 0xac in position 25: invalid start byte

opened 08:49AM - 04 Feb 24 UTC

closed 09:11AM - 06 Feb 24 UTC

Hughhuh

### Describe the bug unicodedecodeerror: 'utf-8' codec can't decode byte 0xac… in position 25: invalid start byte ### Steps to reproduce the bug ``` import sys sys.getdefaultencoding() 'utf-8' from datasets import load_dataset print(f"Train dataset size: {len(dataset['train'])}") print(f"Test dataset size: {len(dataset['test'])}") Resolving data files: 100% 159/159 [00:00<00:00, 9909.28it/s] Using custom data configuration samsum-0b1209637541c9e6 Downloading and preparing dataset json/samsum to C:/Users/Administrator/.cache/huggingface/datasets/json/samsum-0b1209637541c9e6/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51... Downloading data files: 100% 3/3 [00:00<00:00, 119.99it/s] Extracting data files: 100% 3/3 [00:00<00:00, 9.54it/s] Generating train split: 88392/0 [00:15<00:00, 86848.17 examples/s] Generating test split: 0/0 [00:00<?, ? examples/s] --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\packaged_modules\json\json.py:132, in Json._generate_tables(self, files) 131 try: --> 132 pa_table = paj.read_json( 133 io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) 134 ) 135 break File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyarrow\_json.pyx:290, in pyarrow._json.read_json() File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyarrow\error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status() File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\pyarrow\error.pxi:100, in pyarrow.lib.check_status() ArrowInvalid: JSON parse error: Invalid value. in row 0 During handling of the above exception, another exception occurred: UnicodeDecodeError Traceback (most recent call last) File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:1819, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1818 _time = time.time() -> 1819 for _, table in generator: 1820 if max_shard_size is not None and writer._num_bytes > max_shard_size: File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\packaged_modules\json\json.py:153, in Json._generate_tables(self, files) 152 with open(file, encoding="utf-8") as f: --> 153 dataset = json.load(f) 154 except json.JSONDecodeError: File ~\AppData\Local\Programs\Python\Python310\lib\json\__init__.py:293, in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 276 """Deserialize ``fp`` (a ``.read()``-supporting file-like object containing 277 a JSON document) to a Python object. 278 (...) 291 kwarg; otherwise ``JSONDecoder`` is used. 292 """ --> 293 return loads(fp.read(), 294 cls=cls, object_hook=object_hook, 295 parse_float=parse_float, parse_int=parse_int, 296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File ~\AppData\Local\Programs\Python\Python310\lib\codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 25: invalid start byte The above exception was the direct cause of the following exception: DatasetGenerationError Traceback (most recent call last) Cell In[81], line 5 1 from datasets import load_dataset 3 # Load dataset from the hub 4 #dataset = load_dataset("json",data_files="C:/Users/Administrator/Desktop/samsum/samsum/data/corpus/train.json",field="data") ----> 5 dataset = load_dataset('json',"samsum") 6 #dataset = load_dataset("samsum") 7 print(f"Train dataset size: {len(dataset['train'])}") File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\load.py:1758, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs) 1755 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES 1757 # Download and prepare data -> 1758 builder_instance.download_and_prepare( 1759 download_config=download_config, 1760 download_mode=download_mode, 1761 ignore_verifications=ignore_verifications, 1762 try_from_hf_gcs=try_from_hf_gcs, 1763 num_proc=num_proc, 1764 ) 1766 # Build dataset for splits 1767 keep_in_memory = ( 1768 keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size) 1769 ) File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:860, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs) 858 if num_proc is not None: 859 prepare_split_kwargs["num_proc"] = num_proc --> 860 self._download_and_prepare( 861 dl_manager=dl_manager, 862 verify_infos=verify_infos, 863 **prepare_split_kwargs, 864 **download_and_prepare_kwargs, 865 ) 866 # Sync info 867 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values()) File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:953, in DatasetBuilder._download_and_prepare(self, dl_manager, verify_infos, **prepare_split_kwargs) 949 split_dict.add(split_generator.split_info) 951 try: 952 # Prepare split will record examples associated to the split --> 953 self._prepare_split(split_generator, **prepare_split_kwargs) 954 except OSError as e: 955 raise OSError( 956 "Cannot find data file. " 957 + (self.manual_download_instructions or "") 958 + "\nOriginal error:\n" 959 + str(e) 960 ) from None File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:1708, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size) 1706 gen_kwargs = split_generator.gen_kwargs 1707 job_id = 0 -> 1708 for job_id, done, content in self._prepare_split_single( 1709 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args 1710 ): 1711 if done: 1712 result = content File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\datasets\builder.py:1851, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1849 if isinstance(e, SchemaInferenceError) and e.__context__ is not None: 1850 e = e.__context__ -> 1851 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1853 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths) DatasetGenerationError: An error occurred while generating the dataset ``` ### Expected behavior can't load dataset ### Environment info dataset:samsum system :win10 gpu:m40 24G

Topic		Replies	Views
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	11918	August 23, 2023
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (dataset) Beginners	0	365	May 19, 2024
Random utf-8 errors from dataset Intermediate	10	3398	December 8, 2023
UniDecodeError: 'charmap' codec can't decode byte from Load_dataset Beginners	0	55	December 5, 2024
Problem reading my own dataset 🤗Datasets	0	208	May 26, 2024

Datasets.load_datasets fails

Related topics