How to fine-tune BERT on 1 million+ sentences on Kaggle? (Sequence Regression)

Hi, I’m trying to follow the contents of this simple article on applying BERT via HuggingFace Transformers to an NLP regression problem. My dataset however is comprised of around 1.2 million sentences/sequences, out of which ~20% is test. The sentences themselves are decently sized too, in terms of length. My model of choice is distilbert-base-uncased, which is around 60 million parameters.

I understand the massive size would obviously hinder training and inference speeds, especially when I am relying on public compute platforms.

So far, I have been switching between TPU and GPU accelerator options on Kaggle but have not able to properly use the TPU. I’m a beginner and the few notebooks available online on this topic look complicated and I would like to stick to the article. I understand some modifications are needed with XLA and such since I’m using HuggingFace/Pytorch elements and that there is no TPU support for Trainer, but I do not know the right changes to make and where to make them.

Currently on GPU it shows ~22 hours for 1 epoch, which is painful and I am aiming for 2 epochs atleast. How do I bring this down using the TPU? Is that possible? Would appreciate it if anyone can point out w.r.t to the article. Thanks.

Edit P.S: After getting a predicted score on the test sentences, I need to get a predicted score on the train sentences too. :frowning: