Random utf-8 errors from dataset

jbmaxwell · August 26, 2022, 7:09am

I’m getting random utf-8 encoding errors from my dataset:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 117020: invalid start byte

These are datasets saved (using dataset.save_to_disk('/out/file/dir') from the datasets library, and are saved as .arrow files.

I say “random” because they really make no sense. In the current case, I’ve just run 4 epochs of pretraining, I decided to resume for a couple more epochs, and suddenly the error is thrown. This has been the case for a few weeks—random utf-8 errors. Why would this be happening???

Sometimes these errors were thrown during preprocessing the data, so I added the following line:

result = bytes(result, 'utf-8').decode('utf-8', 'ignore')

This seemed to work, as it appeared to allow the preprocessing to complete. But the error turns up at arbitrary times nevertheless.

In some cases a reboot of the machine seems to “cure” the error… though clearly it’s not a “cure” since it isn’t at all “clear” that there’s actually an error… extremely frustrating… Since the problem isn’t consistent I’m really not sure what I can do, except to perhaps stop using datasets.

Is there any way to avoid this?

jbmaxwell · August 26, 2022, 8:16am

…Just updated to 2.4.0 and so far, so good. Mind you, the previous version also worked correctly some of the time.

Matthijs · August 26, 2022, 9:02am

What dataset is this?

Assuming the training examples are made up of text strings, are you indexing these strings at arbitrary (i.e. randomly chosen) positions?

jbmaxwell · August 26, 2022, 3:06pm

Well, I am doing some random selection while I’m preprocessing the data, but that’s done on a list of words/strings. Other than that, no random positions, and certainly no random positions/offsets within a string. Also, in the most recent case I’d already successfully preprocessed and saved the data, and run 4 epochs of training. I hit the error when reloading the data for another couple of epochs (on a one-liner I’d added just to print an entry for verification purposes). Very strange.

Oh, sorry. It’s a custom dataset, non (natural) language. The only non-alphanumeric characters are [, ], <, > and the (custom) tokenizer adds that funky unicode underscore (but the arrow file doesn’t contain the underscore).

Matthijs · August 26, 2022, 4:34pm

I’m wondering if there is any backprop-through-time happening during training that splits the strings into smaller subsequences.

Since you say it doesn’t always happen, it must be something that happens randomly and splitting the training data into smaller sequences might be done using a certain amount of randomness. If the split happens to fall in the middle of a UTF-8 sequence, you’d get that error (since UTF-8 characters can take up multiple bytes in a string).

It’s just a guess on my part but this would be the first place I’d look.

jbmaxwell · August 26, 2022, 4:39pm

Thanks for the thought, Matthijs.

But this is actually before training begins—the error is turning up when simply loading the dataset from disk (and sometimes when saving it to disk). To be clear; I last ran into it when simply printing an entry from the dataset on load. Previously I’ve also hit it when saving with dataset.save_to_disk(out_path).

It’s unpredictable enough that I’m thinking I might swap out the drive today, just in case some sort of file corruption is to blame.

sibiakash · December 7, 2023, 8:01am

hey, stuck with the same error. found a fix ?

mariosasko · December 7, 2023, 9:20pm

Hi! Can you please share the entire error stack trace so we can fix this bug (if it’s related to datasets)?

jbmaxwell · December 7, 2023, 9:34pm

I’ve moved on from that project, at this point, so unfortunately I can’t give a stack trace. But I will say that I think it was more a data problem than a datasets problem. I still see the error with some of the data I’m using now, but I’ve started including chardet in my data pipeline, which seems to fix it (though it’s a bit pokey).

sibiakash · December 8, 2023, 5:30am

found a fix from here… UTF-16 for datasets? - #2 by mariosasko
used encoding = ‘UTF-8’ in my load_dataset and it worked.

sibiakash · December 8, 2023, 5:35am

thank you @mariosasko. was manually removing the error char from the dataset before I found your link

Topic		Replies	Views
Datasets.load_datasets fails 🤗Datasets	12	771	October 11, 2024
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	11986	August 23, 2023
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (dataset) Beginners	0	380	May 19, 2024
UTF-16 for datasets? 🤗Datasets	4	1431	June 21, 2023
Turkish characters gets corrupted when loading dataset via audiofolder 🤗Datasets	1	513	April 4, 2023

Random utf-8 errors from dataset

Related topics