Hi @vikasRajashekar,
I assume what you said lm_labels
is labels
, and -1
is -100
. (see docs here)
Yes
The model tries to learn to predict the last 4 tokens from the context. The context is all input tokens, includes the last four tokens, even the last four tokens are masked or replaced tokens, they contribute correct position information to context. Any way, all input tokens will be used to predict the last four tokens.