UnicodeEncodeError: surrogates not allowed

Hi to everyone,
I usually try not to ask stuff that I could manage to solve alone, but I just finished a very big and complex project, I am kind of exhausted and I need just one small step to have it published on HuggingFace.
I am working with a NLP dataset of approx. 500GB, so it’s big, and the section of the dataset that is giving me problems (I am separating sections of the dataset using “config_name” parameter in hugging_face) is a 300+GB single jsonl file, so I cannot do many trial-errors cause every attempt is very time consuming.

First, I was trying to load the jsonl with the usual load_dataset(‘jsonl’) but even if I have 512GB of ram at disposal on my University’s HPC server, it gets out of memory. I solved it by trying to use a generator function (is my first experience with HF and I haven’t got through all the documentation), but it is quite inefficient. Is there an elegant way pass to the calling function bigger chunks of data (for example, 100GB) instead of single lines? Would it speed it up the process? I was getting more than 300.000 row per seconds with load_dataset but only 15.000 rps by using the generator function. My generator function was just a simple for loop iterating on the jsonl and giving back the results via yield(data).

Even though it was big, the computer managed to process almost the entire dataset in few hours but close to the end I unfortunately got an exception:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9c' in position 8: surrogates not allowed

Texts come from very different sources, many of them are old texts (90s early 00s) so it’s reasonable that in some conversion from latin-1 to utf-8 some characters has been substituted with surrogates. It’s not an issue, I could substitute them with ? and it would be fine, just I wonder what it is the best way to do it on such a big dataset. I could do the substitution in the generator function, but first I don’t know where to found a list of codes that are not allowed by pyarrow, second I need something efficient because it’s a lot of data. Text is in Italian, so it’s a language plenty of characters such as àèéìòù. Text is probably for more than 99% already in legal unicode.
Is it utf8 decode + encode enough? Is it efficient?
Moreover, could someone suggest me how to implement a robust generator function that is both fast and that it’s very unlikely that will crash? The jsons contained in the lines are all valid, as they were already checked by another script before.

Thank you a lot :slight_smile:
Cheers,
Matteo

Hi ! the speed difference comes from the fact that the json loader reads data as Arrow data without instantiating python objects. On the other hand, the generator function does yield python objects that need to be converted to Arrow to instantiate a Dataset.

Maybe you can try/except the json decoding and ignore samples that have UnicodeEncodeError ? To be extra robust you can even ignore all kinds of error

Hi and thanks for the reply!
So now I do no longer care so much about the efficiency of the process: probably there are better ways (maybe loading bigger chunks of the JSONL in the generator?), if someone has something in mind, it could be useful for other users, but in my case I think I am fine even with this 3-4 hours process.
The problem is that I still cannot solve the UnicodeDecode error, and I am getting mad at it. I tried on a toy dataset that I made in which I manually included the \udb9c code. By adding this line to the generator:

data['text']=str(bytes(data['text'],'utf-8','backslashreplace'),'utf-8')

So by converting text to bytes and then again to utf-8, it worked perfectly and the exception disappeared.
Then, and I cannot explain why, with the SAME code I tried again on the big dataset and, after some hours I got:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 7: surrogates not allowed
Idk it doesn’t make any sense to me, because I should already skipped problematic chars with ‘backslashreplace’. I also cannot understand how it’s possible that valid utf8 data produced by another python library (bs4 in this case) then it causes all these problems with huggingface library. Interoperability should be possible by default.
Now I will continue tinkering with it, but I am losing time for a very stupid issue…
I prefer not to skip lines because I don’t know if there are only few of them (probably) or a lot, and in the second case it would be a pity because this one would be the first big dataset for Italian, where in general data is scarce.

1 Like