Problem Statement : To produce a next word prediction model on legal text. The aim is to build an autocomplete model which will make use of existing typed text as well as a possible concatenation of vectors from prior clauses/paragraphs.
Current Approach: Because Bert based model are based on masked language, pretrained models such as LegalBert did not produce good accuracy for prediction of next word when the word to be predicted was marked as [MASK]. Here is an example sentence, “use of [MASK]” where “marked” is the next word to be predicted in place of “[MASK]” token. (Note that there would not be words present after the mask token, only before the token).
Currently approaching the problem as a SequenceClassification problem where labels are the token ids of the words that are to be predicted next. Will also attempt to finetune gpt2 on the legal text using run_clm.py from huggingface examples directory
Is there a better way to approach this problem of next word prediction?
Any suggestions and guidance would be welcome.
Thank you in advance