Need of beginning and end of speech tokens for causal language modeling

jodiak · January 21, 2021, 7:30am

Hello everyone, I’m trying to learn more about language modeling using huggingface, and how some problems can be modeled as a language to train a model to predict the next token in an arbitrary sequence. In this huggingface tutorial, they mention the use of a [BOS token]. Why is this needed? Does a causal language model need tokens to denote when a sequence begins and ends? What might happen if tokens like this are not included in the training dataset? Will this significantly affect the ability of the model to generate sequences that begin and end properly?

sdegrace · January 21, 2021, 5:12pm

Transformer models create embeddings based on context. If the model doesn’t know where a sentence or document begins and ends, it’s harder for it to determine context, and the resulting embeddings could be affected. If information from an irrelevant context bleeds into the embedding for a word, it can only be a bad thing. As for whether it’s significant, it’s a huge “it depends”. But, unless by some luck things turn out just right, the model will always be a little bit worse.

Topic		Replies	Views
Questions on model's tokens 🤗Tokenizers	0	601	March 24, 2021
Causal language modeling documentation is wrong? 🤗Transformers	0	171	May 26, 2023
Add BOS and EOS when encoding a sentence 🤗Tokenizers	2	14677	August 22, 2022
About tokenizer of machine translation Beginners	0	279	July 22, 2022
Token Chunking in Causal/Masked Language Modeling Course	0	851	November 7, 2023

Need of beginning and end of speech tokens for causal language modeling

Related topics