we finetune on the llama3_instruct model AutoModelForCausalLM,
the interface is basically creating a train dataset and eval dataset, each of which contains
{input_ids, labels, attention_mask}
in our case we are training on input like the following
what is the capital of the united states? it's Washington DC.
the whole sentence is tokenized and fed as âinput_idsâ in the train dataset;
then the whole sentence is also tokenized and then fed in as labels too, but we overwrite the labels for each of the positions for the first sentence as IGNORE_INDEX (-100, as defined by pytorch.CrossEntropyLoss ) so the trainer does not care about training the first sentence.
my question is about the attention_mask part. here we do not set any attention_mask on the input, except for setting the mask for all latter padding to be 0. so doesnât the trainer already see the second sentence (âitâs Washington DCâ) to predict the sentence itself? but then I realized that CausalLM should have leftward attention mask built-in to the transformer impl (original âattention is all you needâ paper: âWe need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to ââ) all values in the input of the softmax which correspond to illegal connections.â ) . so I just want to double check, the HF transformers framework, for AutoModelForCausalLM, does automatically mask out input to the right of each input token, right? ------- itâs a bit suspicious because in our training code, I masked out the attention corresponding to the second sentence explicitly, then the results are noticeably different
Models like LLaMa-3 use a causal mask indeed, which means that they can only look at the previous inputs when predicting the next token. This is done here in modeling_llama.py.
The attention_mask is to additionally remove other tokens to not be involved in the attention computation (typically for padding tokens). Letâs see an example.
If we have the sentence âhello my name is Niels <pad> <pad> <pad> <pad>â then the attention_mask would typically look like [1, 1, 1, 1, 0, 0, 0, 0] - assuming each word is turned into a single token for simplicity.
The attention matrix internally will then look like this:
So as can be seen, each word can only look at itself and previous tokens (e.g. âhelloâ => can only see âhelloâ, then âmyâ can only see âhelloâ and âmyâ, etc.) and padding tokens are completely removed from the attention computation.