Character code errors still occur in 2024…
Apparently there are cases where it can be avoided by explicitly specifying it at load time.
If this does not work, there may be another cause.
I’ve moved on from that project, at this point, so unfortunately I can’t give a stack trace. But I will say that I think it was more a data problem than a datasets problem. I still see the error with some of the data I’m using now, but I’ve started including chardet in my data pipeline, which seems to fix it (though it’s a bit pokey).
Thank you very much!!! Problem solved! Your answer really helps me a lot!
Working on a translator, hoping to do fine-tuning with a utf-16 dataset so I can get all the French accents etc.
Datasets load_dataset() doesn’t seem to like non-utf-8
Is there a way to specify or does it HAVE to be utf-8?
If it has to be utf-8, any suggestions for special characters?
dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-16",
)
or
dataset = datasets.load_dataset(
“jxu124/OpenX-Embodiment”,
“berkeley_gnm_cory_hall”,
streaming=False,
split=“train”,
cache_dir=ds_root,
trust_remote_code=True,
encoding="utf-8",
)