Random utf-8 errors from dataset

jbmaxwell · August 26, 2022, 7:09am

I’m getting random utf-8 encoding errors from my dataset:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 117020: invalid start byte

These are datasets saved (using dataset.save_to_disk('/out/file/dir') from the datasets library, and are saved as .arrow files.

I say “random” because they really make no sense. In the current case, I’ve just run 4 epochs of pretraining, I decided to resume for a couple more epochs, and suddenly the error is thrown. This has been the case for a few weeks—random utf-8 errors. Why would this be happening???

Sometimes these errors were thrown during preprocessing the data, so I added the following line:

result = bytes(result, 'utf-8').decode('utf-8', 'ignore')

This seemed to work, as it appeared to allow the preprocessing to complete. But the error turns up at arbitrary times nevertheless.

In some cases a reboot of the machine seems to “cure” the error… though clearly it’s not a “cure” since it isn’t at all “clear” that there’s actually an error… extremely frustrating… Since the problem isn’t consistent I’m really not sure what I can do, except to perhaps stop using datasets.

Is there any way to avoid this?

Topic		Replies	Views
Datasets.load_datasets fails 🤗Datasets	12	847	October 11, 2024
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	12081	August 23, 2023
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (dataset) Beginners	0	392	May 19, 2024
UTF-16 for datasets? 🤗Datasets	4	1486	June 21, 2023
Turkish characters gets corrupted when loading dataset via audiofolder 🤗Datasets	1	517	April 4, 2023

Random utf-8 errors from dataset

Related topics