Batch with Language Model

theodp · February 6, 2024, 8:58am

Hello,

A general question about language models, not on a particular model.

When we work with a batch size > 1 with a language model, multiple sentences are processed at the same time. If I correctly understood how it works, sentences of the same batch must have the same length, which means that for a given batch, padding (special tokens) are introduced for each sentence until it reaches the length of the longest sentence in the token.

However, with this kind of architecture, it would mean that special tokens interact with other “real tokens” (for example with attention mechanism) and modify the information brought by embeddings of these tokens. Moreover, these special tokens have also an embedding which is updated during forward propagation. Why are these tokens modified ? How does it not have an impact on the global embedding of the sentence ?

Thank you for your help

theodp · February 20, 2024, 1:41pm

Update: Padding tokens are mask with src_key_padding.

system · February 21, 2024, 1:41am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
The (hidden) meaning behind the embedding of the padding token? Awesome paper	2	6280	July 14, 2021
Variable length batch decoding 🤗Transformers	11	3928	March 31, 2024
Purpose of padding and truncating Beginners	7	3333	August 3, 2020
Why does padding = 'max_length' cause much slower model inference? Models	1	620	June 8, 2023
Training with varying lengths of sequences Beginners	0	1613	May 31, 2023

Batch with Language Model

Related topics