Does the transformers Trainer.train() automatically set positional attention masks?

we finetune on the llama3_instruct model AutoModelForCausalLM,

the interface is basically creating a train dataset and eval dataset, each of which contains

{input_ids, labels, attention_mask}

in our case we are training on input like the following

what is the capital of the united states? it's Washington DC.

the whole sentence is tokenized and fed as “input_ids” in the train dataset;
then the whole sentence is also tokenized and then fed in as labels too, but we overwrite the labels for each of the positions for the first sentence as IGNORE_INDEX (-100, as defined by pytorch.CrossEntropyLoss ) so the trainer does not care about training the first sentence.

my question is about the attention_mask part. here we do not set any attention_mask on the input, except for setting the mask for all latter padding to be 0. so doesn’t the trainer already see the second sentence (“it’s Washington DC”) to predict the sentence itself? but then I realized that CausalLM should have leftward attention mask built-in to the transformer impl (original “attention is all you need” paper: “We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.” ) . so I just want to double check, the HF transformers framework, for AutoModelForCausalLM, does automatically mask out input to the right of each input token, right? ------- it’s a bit suspicious because in our training code, I masked out the attention corresponding to the second sentence explicitly, then the results are noticeably different

Models like LLaMa-3 use a causal mask indeed, which means that they can only look at the previous inputs when predicting the next token. This is done here in modeling_llama.py.

The attention_mask is to additionally remove other tokens to not be involved in the attention computation (typically for padding tokens). Let’s see an example.

If we have the sentence “hello my name is Niels <pad> <pad> <pad> <pad>” then the attention_mask would typically look like [1, 1, 1, 1, 0, 0, 0, 0] - assuming each word is turned into a single token for simplicity.

The attention matrix internally will then look like this:

|       | hello | my | name | is | Niels | <pad> | <pad> | <pad> | <pad> |
|-------|-------|----|------|----|-------|-------|-------|-------|-------|
| hello | 1     | 0  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |
| my    | 1     | 1  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |
| name  | 1     | 1  | 1    | 0  | 0     | 0     | 0     | 0     | 0     |
| is    | 1     | 1  | 1    | 1  | 0     | 0     | 0     | 0     | 0     |
| Niels | 1     | 1  | 1    | 1  | 1     | 0     | 0     | 0     | 0     |
| <pad> | 0     | 0  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |
| <pad> | 0     | 0  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |
| <pad> | 0     | 0  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |
| <pad> | 0     | 0  | 0    | 0  | 0     | 0     | 0     | 0     | 0     |

So as can be seen, each word can only look at itself and previous tokens (e.g. “hello” => can only see “hello”, then “my” can only see “hello” and “my”, etc.) and padding tokens are completely removed from the attention computation.

hey, nielsr, thanks for your explanation, I have another question, why do some github repos using -100 to pad instead of pad_token_id?

-100 is only used for the labels, this is because -100 is the ignore index of the CrossEntropyLoss of PyTorch: CrossEntropyLoss — PyTorch 2.3 documentation.

It means that each label with a value of -100 will be ignored.

thanks a lot.