Hi and thanks for the reply!
So now I do no longer care so much about the efficiency of the process: probably there are better ways (maybe loading bigger chunks of the JSONL in the generator?), if someone has something in mind, it could be useful for other users, but in my case I think I am fine even with this 3-4 hours process.
The problem is that I still cannot solve the UnicodeDecode error, and I am getting mad at it. I tried on a toy dataset that I made in which I manually included the \udb9c code. By adding this line to the generator:
data['text']=str(bytes(data['text'],'utf-8','backslashreplace'),'utf-8')
So by converting text to bytes and then again to utf-8, it worked perfectly and the exception disappeared.
Then, and I cannot explain why, with the SAME code I tried again on the big dataset and, after some hours I got:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 7: surrogates not allowed
Idk it doesn’t make any sense to me, because I should already skipped problematic chars with ‘backslashreplace’. I also cannot understand how it’s possible that valid utf8 data produced by another python library (bs4 in this case) then it causes all these problems with huggingface library. Interoperability should be possible by default.
Now I will continue tinkering with it, but I am losing time for a very stupid issue…
I prefer not to skip lines because I don’t know if there are only few of them (probably) or a lot, and in the second case it would be a pity because this one would be the first big dataset for Italian, where in general data is scarce.