Why models(llama in particular) upcasts softmax to fp32?

Maykeye · June 28, 2023, 5:19am

Consider the following:

huggingface/transformers/blob/12240925cfa29fff932e49927eb9744713ab1018/src/transformers/models/llama/modeling_llama.py#L232


      
              if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
                  raise ValueError(
                      f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
                  )
              attn_weights = attn_weights + attention_mask
              attn_weights = torch.max(
                  attn_weights, torch.tensor(torch.finfo(attn_weights.dtype).min, device=attn_weights.device)
              )
          
          # upcast attention to fp32
          attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
          attn_output = torch.matmul(attn_weights, value_states)
          
          if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
              raise ValueError(
                  f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                  f" {attn_output.size()}"
              )
          
          attn_output = attn_output.transpose(1, 2)
          attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

For my laptop’s 3080Ti it’s difference between getting OoM at ~1K context on open_llama_3b loaded in bf16 and not getting OoM, sitting at 14GB VRAM used.

Is bf16 that unstable and prone to returning nans/infs? I removed upscale and didn’t notice the difference(other than not getting OoM).

joaogante · June 29, 2023, 10:19am

(see this related GH issue)

Topic		Replies	Views
How to Load Llama-3.3-70B-Instruct Model in Float8 Precision? 🤗Transformers	1	290	December 11, 2024
LLaMA2 7B uses > 128 GB of GPU Ram and fails with OOM or Loss Scale Minimum 🤗Transformers	3	5563	August 17, 2023
Llama 3.1 8b Instruct - Memory Usage More than Reported Models	5	468	February 18, 2025
Why are Llama2 attention weights not lower triangular? Models	2	409	May 15, 2024
Why are some weights FP32 in Llama 3.1 405B FBGEMM FP8 Quantization? Models	7	491	September 27, 2024

Why models(llama in particular) upcasts softmax to fp32?

Related topics