I have tried to prepare ELI5 to train with T5, based on this wonderful notebook of Suraj Patil
However, when I run
dataset.map() on ELI5 to prepare
dataset.map is frozen in the first hundreds examples. On the contrary, this works totally fine on SQUAD (80,000 examples). Both
nlp version 0.3.0 and 0.4.0 cause frozen process . Also try various
pyarrow versions from 0.16.0 / 0.17.0 / 1.0.0 also have the same frozen process.
Reproducible code can be found on this colab notebook , where I also show that the same mapping function works fine on SQUAD, so the problem is likely due to ELI5 somehow.
More Info : instead of
map, if I run
for loop and apply function by myself, there’s no error and can finish within 10 seconds. However,
nlp dataset is immutable , so I could not create a new key-value within the dataset directly ) .
I also notice that SQUAD texts are quite clean while ELI5 texts contain many special characters, not sure if this is the cause ?