[SOLVED] Dataset.map() is frozen on ELI5

Jung · August 10, 2020, 9:17am

I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil

However, when I run dataset.map() on ELI5 to prepare input_text, target_text, dataset.map is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both nlp version 0.3.0 and 0.4.0 cause frozen process . Also try various pyarrow versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.

Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.

More Info : instead of map, if I run for loop and apply function by myself, there’s no error and can finish within 10 seconds. However, nlp dataset is immutable , so I could not create a new key-value within the dataset directly ) .

I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?

Jung · August 11, 2020, 11:57pm

Fixed by amazing Quentin here:
https://github.com/huggingface/nlp/issues/482

Thanks very much again!

lhoestq · August 24, 2020, 7:50pm

You’re welcome

Topic		Replies	Views
Map multiprocessing Issue 🤗Datasets	31	17750	July 16, 2024
How to tokenize using map 🤗Datasets	4	6243	April 14, 2021
Dataset map function takes forever to run! 🤗Datasets	16	6793	August 15, 2024
Chapter 5 questions Course	105	8493	July 7, 2025
Dataset map() raises value error when mapping list to dict-like class 🤗Datasets	6	106	August 15, 2024

[SOLVED] Dataset.map() is frozen on ELI5

Related topics