Hi to everyone,
I usually try not to ask stuff that I could manage to solve alone, but I just finished a very big and complex project, I am kind of exhausted and I need just one small step to have it published on HuggingFace.
I am working with a NLP dataset of approx. 500GB, so it’s big, and the section of the dataset that is giving me problems (I am separating sections of the dataset using “config_name” parameter in hugging_face) is a 300+GB single jsonl file, so I cannot do many trial-errors cause every attempt is very time consuming.
First, I was trying to load the jsonl with the usual load_dataset(‘jsonl’) but even if I have 512GB of ram at disposal on my University’s HPC server, it gets out of memory. I solved it by trying to use a generator function (is my first experience with HF and I haven’t got through all the documentation), but it is quite inefficient. Is there an elegant way pass to the calling function bigger chunks of data (for example, 100GB) instead of single lines? Would it speed it up the process? I was getting more than 300.000 row per seconds with load_dataset but only 15.000 rps by using the generator function. My generator function was just a simple for loop iterating on the jsonl and giving back the results via yield(data).
Even though it was big, the computer managed to process almost the entire dataset in few hours but close to the end I unfortunately got an exception:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udb9c' in position 8: surrogates not allowed
Texts come from very different sources, many of them are old texts (90s early 00s) so it’s reasonable that in some conversion from latin-1 to utf-8 some characters has been substituted with surrogates. It’s not an issue, I could substitute them with ? and it would be fine, just I wonder what it is the best way to do it on such a big dataset. I could do the substitution in the generator function, but first I don’t know where to found a list of codes that are not allowed by pyarrow, second I need something efficient because it’s a lot of data. Text is in Italian, so it’s a language plenty of characters such as àèéìòù. Text is probably for more than 99% already in legal unicode.
Is it utf8 decode + encode enough? Is it efficient?
Moreover, could someone suggest me how to implement a robust generator function that is both fast and that it’s very unlikely that will crash? The jsons contained in the lines are all valid, as they were already checked by another script before.
Thank you a lot
Cheers,
Matteo