Random utf-8 errors from dataset

I’m getting random utf-8 encoding errors from my dataset:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 117020: invalid start byte

These are datasets saved (using dataset.save_to_disk('/out/file/dir') from the datasets library, and are saved as .arrow files.

I say “random” because they really make no sense. In the current case, I’ve just run 4 epochs of pretraining, I decided to resume for a couple more epochs, and suddenly the error is thrown. This has been the case for a few weeks—random utf-8 errors. Why would this be happening???

Sometimes these errors were thrown during preprocessing the data, so I added the following line:

result = bytes(result, 'utf-8').decode('utf-8', 'ignore')

This seemed to work, as it appeared to allow the preprocessing to complete. But the error turns up at arbitrary times nevertheless.

In some cases a reboot of the machine seems to “cure” the error… though clearly it’s not a “cure” since it isn’t at all “clear” that there’s actually an error… extremely frustrating… Since the problem isn’t consistent I’m really not sure what I can do, except to perhaps stop using datasets.

Is there any way to avoid this?

…Just updated to 2.4.0 and so far, so good. Mind you, the previous version also worked correctly some of the time. :crossed_fingers:

What dataset is this?

Assuming the training examples are made up of text strings, are you indexing these strings at arbitrary (i.e. randomly chosen) positions?

Well, I am doing some random selection while I’m preprocessing the data, but that’s done on a list of words/strings. Other than that, no random positions, and certainly no random positions/offsets within a string. Also, in the most recent case I’d already successfully preprocessed and saved the data, and run 4 epochs of training. I hit the error when reloading the data for another couple of epochs (on a one-liner I’d added just to print an entry for verification purposes). Very strange.

Oh, sorry. It’s a custom dataset, non (natural) language. The only non-alphanumeric characters are [, ], <, > and the (custom) tokenizer adds that funky unicode underscore (but the arrow file doesn’t contain the underscore).

I’m wondering if there is any backprop-through-time happening during training that splits the strings into smaller subsequences.

Since you say it doesn’t always happen, it must be something that happens randomly and splitting the training data into smaller sequences might be done using a certain amount of randomness. If the split happens to fall in the middle of a UTF-8 sequence, you’d get that error (since UTF-8 characters can take up multiple bytes in a string).

It’s just a guess on my part but this would be the first place I’d look.

Thanks for the thought, Matthijs.

But this is actually before training begins—the error is turning up when simply loading the dataset from disk (and sometimes when saving it to disk). To be clear; I last ran into it when simply printing an entry from the dataset on load. Previously I’ve also hit it when saving with dataset.save_to_disk(out_path).

It’s unpredictable enough that I’m thinking I might swap out the drive today, just in case some sort of file corruption is to blame.