[Suggestions and Guidance]Finetuning Bert models for Next word Prediction

Hi Sumanth! I believe you are already on the right track by finetuning gpt2. The difference is that GPT was trained using causal/autoregressive attention. It means that GPT is specifically trained to predict the next word without having access to the word to the right of the masked token (unlike BERT).

The different models and their architectures are depicted in this chart:

Long story short - you should see better results with GPT2. Let us know how it goes.

Cheers
Heiko

1 Like