Problem Statement : To produce a next word prediction model on legal text. The aim is to build an autocomplete model which will make use of existing typed text as well as a possible concatenation of vectors from prior clauses/paragraphs.
Current Approach: Because Bert based model are based on masked language, pretrained models such as LegalBert did not produce good accuracy for prediction of next word when the word to be predicted was marked as [MASK]. Here is an example sentence, “use of [MASK]” where “marked” is the next word to be predicted in place of “[MASK]” token. (Note that there would not be words present after the mask token, only before the token).
Currently approaching the problem as a SequenceClassification problem where labels are the token ids of the words that are to be predicted next. Will also attempt to finetune gpt2 on the legal text using run_clm.py from huggingface examples directory
Is there a better way to approach this problem of next word prediction?
Any suggestions and guidance would be welcome.
Thank you in advance
Hi Sumanth! I believe you are already on the right track by finetuning gpt2. The difference is that GPT was trained using causal/autoregressive attention. It means that GPT is specifically trained to predict the next word without having access to the word to the right of the masked token (unlike BERT).
The different models and their architectures are depicted in this chart:
@marshmellow77 a question. Is there a way to finetune and use T5 or BigBird for this Next word prediction task?. Unable to find tutorials for using these models for Next word prediction.