Batch with Language Model


A general question about language models, not on a particular model.

When we work with a batch size > 1 with a language model, multiple sentences are processed at the same time. If I correctly understood how it works, sentences of the same batch must have the same length, which means that for a given batch, padding (special tokens) are introduced for each sentence until it reaches the length of the longest sentence in the token.

However, with this kind of architecture, it would mean that special tokens interact with other “real tokens” (for example with attention mechanism) and modify the information brought by embeddings of these tokens. Moreover, these special tokens have also an embedding which is updated during forward propagation. Why are these tokens modified ? How does it not have an impact on the global embedding of the sentence ?

Thank you for your help

Update: Padding tokens are mask with src_key_padding.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.