Random utf-8 errors from dataset

I’m getting random utf-8 encoding errors from my dataset:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 117020: invalid start byte

These are datasets saved (using dataset.save_to_disk('/out/file/dir') from the datasets library, and are saved as .arrow files.

I say “random” because they really make no sense. In the current case, I’ve just run 4 epochs of pretraining, I decided to resume for a couple more epochs, and suddenly the error is thrown. This has been the case for a few weeks—random utf-8 errors. Why would this be happening???

Sometimes these errors were thrown during preprocessing the data, so I added the following line:

result = bytes(result, 'utf-8').decode('utf-8', 'ignore')

This seemed to work, as it appeared to allow the preprocessing to complete. But the error turns up at arbitrary times nevertheless.

In some cases a reboot of the machine seems to “cure” the error… though clearly it’s not a “cure” since it isn’t at all “clear” that there’s actually an error… extremely frustrating… Since the problem isn’t consistent I’m really not sure what I can do, except to perhaps stop using datasets.

Is there any way to avoid this?