UnicodeEncodeError: surrogates not allowed

mrinaldi · May 19, 2024, 12:33pm

Hi to everyone,
I usually try not to ask stuff that I could manage to solve alone, but I just finished a very big and complex project, I am kind of exhausted and I need just one small step to have it published on HuggingFace.
I am working with a NLP dataset of approx. 500GB, so it’s big, and the section of the dataset that is giving me problems (I am separating sections of the dataset using “config_name” parameter in hugging_face) is a 300+GB single jsonl file, so I cannot do many trial-errors cause every attempt is very time consuming.

First, I was trying to load the jsonl with the usual load_dataset(‘jsonl’) but even if I have 512GB of ram at disposal on my University’s HPC server, it gets out of memory. I solved it by trying to use a generator function (is my first experience with HF and I haven’t got through all the documentation), but it is quite inefficient. Is there an elegant way pass to the calling function bigger chunks of data (for example, 100GB) instead of single lines? Would it speed it up the process? I was getting more than 300.000 row per seconds with load_dataset but only 15.000 rps by using the generator function. My generator function was just a simple for loop iterating on the jsonl and giving back the results via yield(data).

Even though it was big, the computer managed to process almost the entire dataset in few hours but close to the end I unfortunately got an exception:

UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9c' in position 8: surrogates not allowed

Texts come from very different sources, many of them are old texts (90s early 00s) so it’s reasonable that in some conversion from latin-1 to utf-8 some characters has been substituted with surrogates. It’s not an issue, I could substitute them with ? and it would be fine, just I wonder what it is the best way to do it on such a big dataset. I could do the substitution in the generator function, but first I don’t know where to found a list of codes that are not allowed by pyarrow, second I need something efficient because it’s a lot of data. Text is in Italian, so it’s a language plenty of characters such as àèéìòù. Text is probably for more than 99% already in legal unicode.
Is it utf8 decode + encode enough? Is it efficient?
Moreover, could someone suggest me how to implement a robust generator function that is both fast and that it’s very unlikely that will crash? The jsons contained in the lines are all valid, as they were already checked by another script before.

Thank you a lot
Cheers,
Matteo

lhoestq · May 23, 2024, 9:31am

Hi ! the speed difference comes from the fact that the json loader reads data as Arrow data without instantiating python objects. On the other hand, the generator function does yield python objects that need to be converted to Arrow to instantiate a Dataset.

Maybe you can try/except the json decoding and ignore samples that have UnicodeEncodeError ? To be extra robust you can even ignore all kinds of error

mrinaldi · May 26, 2024, 11:09am

Hi and thanks for the reply!
So now I do no longer care so much about the efficiency of the process: probably there are better ways (maybe loading bigger chunks of the JSONL in the generator?), if someone has something in mind, it could be useful for other users, but in my case I think I am fine even with this 3-4 hours process.
The problem is that I still cannot solve the UnicodeDecode error, and I am getting mad at it. I tried on a toy dataset that I made in which I manually included the \udb9c code. By adding this line to the generator:

data['text']=str(bytes(data['text'],'utf-8','backslashreplace'),'utf-8')

So by converting text to bytes and then again to utf-8, it worked perfectly and the exception disappeared.
Then, and I cannot explain why, with the SAME code I tried again on the big dataset and, after some hours I got:
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 7: surrogates not allowed
Idk it doesn’t make any sense to me, because I should already skipped problematic chars with ‘backslashreplace’. I also cannot understand how it’s possible that valid utf8 data produced by another python library (bs4 in this case) then it causes all these problems with huggingface library. Interoperability should be possible by default.
Now I will continue tinkering with it, but I am losing time for a very stupid issue…
I prefer not to skip lines because I don’t know if there are only few of them (probably) or a lot, and in the second case it would be a pity because this one would be the first big dataset for Italian, where in general data is scarce.

Topic		Replies	Views
What can I do to optimize this process? Beginners	0	300	November 20, 2022
Building a dataset file for machine translation and add it to Huggingface Datasets 🤗Datasets	1	1150	May 25, 2021
Fine-tune a model on translation: Beginners	6	1861	July 17, 2021
Get UnicodeEncodeError while using pipeline for question answering Intermediate	0	472	October 12, 2022
Receiving Error When trying to Tokenize Dataset with Distilbert Beginners	0	1942	August 28, 2022

UnicodeEncodeError: surrogates not allowed

Related topics