Output dimension of AutoModelForCausalLM

cyrilvallez · July 18, 2023, 3:48pm

The output dimension of models for causal LM is (batch_size, sequence_length, config.vocab_size). I don’t understand why this is the case. I would expect the outputs to be (batch_size, config.vocab_size), i.e. the logits for next token prediction.
Indeed, the generate method always uses next_token_logits = outputs.logits[:, -1, :], i.e. only the last tokens logits for next token prediction. However, the logits for all tokens of the sentence are computed, which very quickly blows the memory up for large sequence_length or max_new_tokens.

For example, for the Bloom family, the vocab_size is 250880 (which is one of the largest vocab size). This means that even using bloom-560M (which is a small model), inference on a batch size of 64 with a prompt of say 500 tokens (or a small prompt with a large max_new_tokens = 500) will take AT LEAST 64 * 500 * 250880 * 2 / 1024**3 = 14.95 GiB (the factor 2 is in case we loaded the model in float16, otherwise multiply this number by 2 again). And this is not even taking into account the memory needed for intermediate results in the forward method.
So we are basically blowing up the memory very quickly for large sequences or large max_new_tokens, despite the fact that we only need the logits for the next token predictions and not the tokens we already have!

Is there any way to modify the forward method of models for causal LM to only output logits of dimension (batch_size, config.vocab_size)? Or is it a property of the underlying models and that would need retraining for scratch?

matthewclso · July 2, 2024, 6:52am

Causal LM predicts the next token given each token. This means outputs.logits[:, 0, :] is computed given token 0, outputs.logits[:, 1, :] is computed given tokens 0 and 1, and so on…and finally outputs.logits[:, -1, :] is computed given the entire sequence (and that’s why it’s the next predicted token). We do this by using masked attention - at each time step, the model can only compute outputs given past tokens.

As far as I know, there isn’t a way to change this without training a model from scratch. I’m not even sure if it would result in that much compute savings (aside from the size of the output layer). Remember we still need to attend to all previous tokens when we calculate the next likely token, so it’s not like we can throw those calculations away.

Topic		Replies	Views
Clarification onf the AutoModelForCausalML output Beginners	0	125	March 1, 2024
Different lm_head size and vocab_size 🤗Transformers	0	858	July 12, 2022
(first token generation puzzle)Why does transformers take the last dimension as output when generating the first token in language generation process? 🤗Transformers	9	2092	May 11, 2025
Next_token ambiguity in Causal Language Modeling sample Beginners	0	366	June 4, 2021
Understanding model output arrays Beginners	0	616	August 28, 2022

Output dimension of AutoModelForCausalLM

Related topics