Datasets.load_datasets fails

I am trying to load a dataset from the openx datasets using a command like this:

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
)

But it’s failing on any argument configurations. The summarized error stack is:

File “~/.cache/huggingface/modules/datasets_modules/datasets/jxu124–OpenX-Embodiment/317e9044a9bb97bb1db9ea5aebf1c15f5cc3e1e071e5da025e97892e96dae22b/OpenX-Embodiment.py”, line 29, in decode_image
data = data.decode()
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

The above exception was the direct cause of the following exception:
.
.
.
File “…/lib/python3.10/site-packages/datasets/builder.py”, line 1642, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Looks like the API is buggy for this specific dataset. Does anyone know how to successfully load data from this dataset suit? Thanks.

1 Like

Character code errors still occur in 2024…
Apparently there are cases where it can be avoided by explicitly specifying it at load time.
If this does not work, there may be another cause.

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-16",
)

or

dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-8",
)

Thanks for the reply, @John6666.
Checking the dataset.load_dataset(), I don’t see any ‘encoding’ argument option. Setting it triggers such errors:

ValueError: BuilderConfig OpenXConfig (name='berkeley_gnm_cory_hall', version=0.0.0, data_dir=None, data_files=None, description=None, features=None) doesn't have a 'encoding' key

Other problem when tracing the error stack at OpenX-Embodiment.py", line 30: data = data.decode(), I can see the data is not an image, but some bogus values like

b'/home/jonathan/dev/gnm_dataset/cory_hall/cory5_aug17_00_0546'

So, make me thinking if the API is not downloading the data properly.

The referenced article is from 2023, so maybe the argument is obsolete…
Anyway, I agree with you that just because the error message is UTF-8 related does not mean that it is a character code issue.
I hope it’s a problem that can be avoided by modifying the options or code without changing the dataset.
It could even be some kind of bug that has never been fixed because it can only be avoided by changing the dataset.

I’m sure you’ve done this long ago, but just a spell.

pip uninstall datasets
pip install git+https://github.com/huggingface/datasets.git

This is an article from April of this year, but it seems that some people are using an unusually old datasets library, perhaps pulled in by some library dependency.

Thanks for the suggestions, @John6666 . I tried changing versions, but doesn’t look like working. Maybe it’s because the code is not well-maintained &/ dataset formats have become incompatible.

1 Like

@lhoestq There is an occasional problem with some datasets not loading properly in the datasets library. On the surface this appears as a UTF-8 error, but the actual cause is unknown. Is this fixable?

We’re using PyArrow to load the data, maybe we can see with them how to improve it. Do you have a reproducible example ?

1 Like

I would have to ask yashara to know exactly, but perhaps this data set can reproduce it. Also, I think I’ve seen some of these in past forums and github posts, so I’ll look for them later.

import datasets
dataset = datasets.load_dataset(
    “jxu124/OpenX-Embodiment”,
    “berkeley_gnm_cory_hall”,
    streaming=False,
    split=“train”,
    cache_dir=ds_root,
    trust_remote_code=True,
)

Do you have an example that doesn’t use trust_remote_code ? We stopped developing that option because it’s not great for obvious security reasons.

Also note that datasets in WebDataset format are supported out-of-the-box so trust_remote_code / having a loading script is not needed in that case.

1 Like

Do you have an example that doesn’t use trust_remote_code ?

2023, but his code doesn’t seem to trust_remote_code.

from datasets import load_dataset
raw_datasets = load_dataset("roneneldan/TinyStories")
raw_datasets.save_to_disk("Tiny_Stories")
raw_datasets = load_dataset('text', data_dir = "Tiny_Stories")

You should use load_form_disk after save_to_disk (it’s a serialized form for local disk - not optimized for online sharing but faster to reload locally):

from datasets import load_dataset, load_from_disk
raw_datasets = load_dataset("roneneldan/TinyStories")
raw_datasets.save_to_disk("Tiny_Stories")
raw_datasets = load_from_disk("Tiny_Stories")

Ah. That was a completely different error.
That’s all the errors I’ve seen in my searches. Here is an example that was addressed by an option added in a library upgrade. The rest of the problems don’t seem to exist.