I’m getting random utf-8 encoding errors from my dataset:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 117020: invalid start byte
These are datasets saved (using dataset.save_to_disk('/out/file/dir'
) from the datasets
library, and are saved as .arrow
files.
I say “random” because they really make no sense. In the current case, I’ve just run 4 epochs of pretraining, I decided to resume for a couple more epochs, and suddenly the error is thrown. This has been the case for a few weeks—random utf-8 errors. Why would this be happening???
Sometimes these errors were thrown during preprocessing the data, so I added the following line:
result = bytes(result, 'utf-8').decode('utf-8', 'ignore')
This seemed to work, as it appeared to allow the preprocessing to complete. But the error turns up at arbitrary times nevertheless.
In some cases a reboot of the machine seems to “cure” the error… though clearly it’s not a “cure” since it isn’t at all “clear” that there’s actually an error… extremely frustrating… Since the problem isn’t consistent I’m really not sure what I can do, except to perhaps stop using datasets
.
Is there any way to avoid this?