I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil
However, when I run dataset.map()
on ELI5 to prepare input_text, target_text
, dataset.map
is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both nlp
version 0.3.0 and 0.4.0 cause frozen process . Also try various pyarrow
versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.
Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.
More Info : instead of map
, if I run for
loop and apply function by myself, there’s no error and can finish within 10 seconds. However, nlp dataset
is immutable , so I could not create a new key-value within the dataset directly ) .
I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?