UnicodeEncodeError: surrogates not allowed

mrinaldi · May 26, 2024, 11:09am

Hi and thanks for the reply!
So now I do no longer care so much about the efficiency of the process: probably there are better ways (maybe loading bigger chunks of the JSONL in the generator?), if someone has something in mind, it could be useful for other users, but in my case I think I am fine even with this 3-4 hours process.
The problem is that I still cannot solve the UnicodeDecode error, and I am getting mad at it. I tried on a toy dataset that I made in which I manually included the \udb9c code. By adding this line to the generator:

data['text']=str(bytes(data['text'],'utf-8','backslashreplace'),'utf-8')

So by converting text to bytes and then again to utf-8, it worked perfectly and the exception disappeared.
Then, and I cannot explain why, with the SAME code I tried again on the big dataset and, after some hours I got:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 7: surrogates not allowed
Idk it doesn’t make any sense to me, because I should already skipped problematic chars with ‘backslashreplace’. I also cannot understand how it’s possible that valid utf8 data produced by another python library (bs4 in this case) then it causes all these problems with huggingface library. Interoperability should be possible by default.
Now I will continue tinkering with it, but I am losing time for a very stupid issue…
I prefer not to skip lines because I don’t know if there are only few of them (probably) or a lot, and in the second case it would be a pity because this one would be the first big dataset for Italian, where in general data is scarce.

Topic		Replies	Views
'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Beginners	3	12085	August 23, 2023
Datasets.load_datasets fails 🤗Datasets	12	850	October 11, 2024
Exceeded maximum rows when load_dataset for JSON 🤗Datasets	4	1139	April 6, 2023
Cant create dataset with encoding 🤗Datasets	1	714	November 26, 2023
Random utf-8 errors from dataset Intermediate	10	3644	December 8, 2023

UnicodeEncodeError: surrogates not allowed

Related topics