Need of beginning and end of speech tokens for causal language modeling

Hello everyone, I’m trying to learn more about language modeling using huggingface, and how some problems can be modeled as a language to train a model to predict the next token in an arbitrary sequence. In this huggingface tutorial, they mention the use of a [BOS token]. Why is this needed? Does a causal language model need tokens to denote when a sequence begins and ends? What might happen if tokens like this are not included in the training dataset? Will this significantly affect the ability of the model to generate sequences that begin and end properly?

Transformer models create embeddings based on context. If the model doesn’t know where a sentence or document begins and ends, it’s harder for it to determine context, and the resulting embeddings could be affected. If information from an irrelevant context bleeds into the embedding for a word, it can only be a bad thing. As for whether it’s significant, it’s a huge “it depends”. But, unless by some luck things turn out just right, the model will always be a little bit worse.

2 Likes