Hi, I’m trying to follow the contents of this simple article on applying BERT via HuggingFace Transformers to an NLP regression problem. My dataset however is comprised of around 1.2 million sentences/sequences, out of which ~20% is test. The sentences themselves are decently sized too, in terms of length. My model of choice is distilbert-base-uncased, which is around 60 million parameters.
I understand the massive size would obviously hinder training and inference speeds, especially when I am relying on public compute platforms.
So far, I have been switching between TPU and GPU accelerator options on Kaggle but have not able to properly use the TPU. I’m a beginner and the few notebooks available online on this topic look complicated and I would like to stick to the article. I understand some modifications are needed with XLA and such since I’m using HuggingFace/Pytorch elements and that there is no TPU support for Trainer, but I do not know the right changes to make and where to make them.
Currently on GPU it shows ~22 hours for 1 epoch, which is painful and I am aiming for 2 epochs atleast. How do I bring this down using the TPU? Is that possible? Would appreciate it if anyone can point out w.r.t to the article. Thanks.
Edit P.S: After getting a predicted score on the test sentences, I need to get a predicted score on the train sentences too.