[SOLVED] Dataset.map() is frozen on ELI5

I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil

However, when I run dataset.map() on ELI5 to prepare input_text, target_text, dataset.map is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both nlp version 0.3.0 and 0.4.0 cause frozen process . Also try various pyarrow versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.

Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.


More Info : instead of map, if I run for loop and apply function by myself, there’s no error and can finish within 10 seconds. However, nlp dataset is immutable , so I could not create a new key-value within the dataset directly ) .

I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?

Fixed by amazing Quentin here:
https://github.com/huggingface/nlp/issues/482

Thanks very much again!

1 Like

You’re welcome :slight_smile:

1 Like