why -100? what does this achieve?
The value
-100
is a special token ID used by HuggingFace’s Transformers library to indicate that a particular token should be ignored when computing the loss.
why not mask = 0 for the indices you want to not train on?