Gpt-oss training on A100 - OOM error

Dear community,

I am following OpenAI/HF cookbook to fine-tune gpt-oss:

Fine-tuning with gpt-oss and Hugging Face Transformers

I tried training gpt-oss with A100 GPU that comes with 80 GB vRAM but got the below OOM error.

I already set my trl config as below to try to reduce memory pressure.

I appreciate your feedback.

training_args = SFTConfig(

learning_rate=2e-4,

gradient_checkpointing=True,

num_train_epochs=1,

logging_steps=1,

per_device_train_batch_size=4,

gradient_accumulation_steps=4,

max_length=1024,

warmup_ratio=0.03,

lr_scheduler_type=“cosine_with_min_lr”,

lr_scheduler_kwargs={“min_lr_rate”: 0.1},

output_dir=“gpt-oss-20b-multilingual-reasoner”,

push_to_hub=True,

)

from trl import SFTTrainer

trainer = SFTTrainer(

model=peft_model,

args=training_args,

train_dataset=dataset,

processing_class=tokenizer,

)

trainer.train()

trainer.train()

0%| | 0/63 [00:43<?, ?it/s]

Traceback (most recent call last): | 0/63 [00:00<?, ?it/s]

File “”, line 1, in

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/trainer.py”, line 2316, in train

return inner_training_loop(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/trainer.py”, line 2674, in _inner_training_loop

tr_loss_step = self.training_step(model, inputs, num_items_in_batch)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/trl/trainer/sft_trainer.py”, line 1190, in training_step

return super().training_step(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/trainer.py”, line 4020, in training_step

loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/trl/trainer/sft_trainer.py”, line 1103, in compute_loss

(loss, outputs) = super().compute_loss(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/trainer.py”, line 4110, in compute_loss

outputs = model(\*\*inputs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl

return self.\_call_impl(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1786, in _call_impl

return forward_call(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/accelerate/utils/operations.py”, line 819, in forward

return model_forward(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/accelerate/utils/operations.py”, line 807, in _call_

return convert_to_fp32(self.model_forward(\*args, \*\*kwargs))

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/amp/autocast_mode.py”, line 44, in decorate_autocast

return func(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/accelerate/utils/operations.py”, line 819, in forward

return model_forward(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/accelerate/utils/operations.py”, line 807, in _call_

return convert_to_fp32(self.model_forward(\*args, \*\*kwargs))

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/amp/autocast_mode.py”, line 44, in decorate_autocast

return func(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/peft/peft_model.py”, line 921, in forward

return self.get_base_model()(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl

return self.\_call_impl(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1786, in _call_impl

return forward_call(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/utils/generic.py”, line 918, in wrapper

output = func(self, \*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py”, line 668, in forward

outputs: MoeModelOutputWithPast = self.model(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl

return self.\_call_impl(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1786, in _call_impl

return forward_call(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/utils/generic.py”, line 1064, in wrapper

outputs = func(self, \*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py”, line 507, in forward

hidden_states = decoder_layer(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/modeling_layers.py”, line 93, in _call_

return self.\_gradient_checkpointing_func(partial(super().\__call_\_, \*\*kwargs), \*args)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/_compile.py”, line 53, in inner

return disable_fn(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py”, line 1044, in _fn

return fn(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/utils/checkpoint.py”, line 496, in checkpoint

return CheckpointFunction.apply(function, preserve, \*args)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/autograd/function.py”, line 581, in apply

return super().apply(\*args, \*\*kwargs)  # type: ignore\[misc\]

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/utils/checkpoint.py”, line 262, in forward

outputs = run_function(\*args)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl

return self.\_call_impl(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1786, in _call_impl

return forward_call(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func

return func(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py”, line 371, in forward

hidden_states, \_ = self.self_attn(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl

return self.\_call_impl(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1786, in _call_impl

return forward_call(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func

return func(\*args, \*\*kwargs)

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py”, line 328, in forward

attn_output, attn_weights = attention_interface(

File “/home/user/.pyenv/versions/3.10.19/lib/python3.10/site-packages/transformers/models/gpt_oss/modeling_gpt_oss.py”, line 253, in eager_attention_forward

attn_weights = torch.matmul(query, key_states.transpose(2, 3)) \* scaling

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacity of 79.25 GiB of which 401.88 MiB is free. Process 213371 has 78.85 GiB memory in use. Of the allocated memory 77.36 GiB is allocated by PyTorch, and 1011.53 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.9 documentation )

1 Like

Perhaps real OOM?

Do you want to train in fp32 precision? If not required, try bf16 precision. I see convert_to_fp32 in the stack trace and hence this suggestion.

Also, make sure the training data shapes are as expected.

1 Like

Dear DhineshR, thanks for your reply.

The convert_to_fp32 you mentioned is part of error message but I did not set such parameter.

The torch dtype of the model config is the one recommended in openAI cookbook:

import torch

from transformers import AutoModelForCausalLM, Mxfp4Config

quantization_config = Mxfp4Config(dequantize=True)

model_kwargs = dict(

attn_implementation=“eager”,

torch_dtype=torch.bfloat16,

quantization_config=quantization_config,

use_cache=False,

device_map=“auto”,

)

I appreciate your feedback.

model = AutoModelForCausalLM.from_pretrained(“openai/gpt-oss-20b”, **model_kwargs)

1 Like

Though not set explicitly, the stack trace shows at some point the tensors are converted to fp32.

You could try increasing the batch size by 1 starting from 1 and find the breaking point.

You could also calculate size of activations for your batch_size, max_length, and precision of training. And, check if there is enough space left in the GPU for activations after allocating required amount of memory for fine-tuning.

1 Like

Dear DhineshR, thanks for your feedback.

I try downsizing:

per_device_train_batch_size=1, gradient_accumulation_steps=1,

But still results in OOM error message.

I also tried to:

model_kwargs = dict(

attn_implementation=“eager”,

torch_dtype=torch.bfloat16,

quantization_config=quantization_config,

use_cache=False,

#device_map=“auto”,

)

as I noticed warning info about model beeing on multiple device (although runnning on A100 single GPU) after:

trainer = SFTTrainer(

… model=peft_model,

… args=training_args,

… train_dataset=dataset,

… processing_class=tokenizer,

… )

The model is already on multiple devices. Skipping the move to device specified in `args`.

But still result in OOM error.

Do you have other thought ?

Right now I am considering testing on H100 but not sure this will make it as VRAM is equal to 80 GB as well.

Kind regards

1 Like

I’d suggest checking the number of GPUs you have and make sure they are as expected by running nvidia-smi.

If you want to only use one GPU, try setting CUDA_VISIBLE_DEVICES=0 environment variable when running the training script.

However, I believe 80GB is not enough to fientune gpt-oss 20B at bf16 precision. There are many guides available online that would help you estimate the amount of memory required to fine-tune. I’d suggest performing those calculations for your use case.

1 Like

Dear DhineshR, thanks for your feedback.
In OpenAI cookbook documentation they state that:

***Note: This notebook is designed to be run on a single H100 GPU with 80GB of memory. If you have access to a smaller GPU, you can reduce the batch size and sequence length in the hyperparameters below.


I tried additional HW set up testing with Nvidia H100 and Nvidia 4xA10G large (which comes with 92 GB VRAM). In both cases I still get OOM error.

Any other idea ? Tx

1 Like

When considering the difference compared to the H100 GPU environment shown in OpenAI’s Notebook, the overhead of executing torch.compile might be the culprit. Next in line are differences in PyTorch, CUDA, or Transformers versions.

Either way, I think VRAM is quite tight and failures are likely to occur from even minor differences, so personally, I recommend using QLoRA.


Main causes of OOM on a single H100

  1. You’re really doing BF16 LoRA on a 20B MoE, not FP4 training

    • The cookbook loads with Mxfp4Config(dequantize=True) and torch_dtype=torch.bfloat16, which dequantizes MXFP4 weights to BF16 on GPU. (OpenAI Cookbook)
    • Unsloth’s analysis: any non-Unsloth training stack must upcast to BF16, and then “all other training methods will require a minimum of 65GB VRAM to train the 20B model.” (Unsloth)
  2. Attention uses full L×L “eager” kernels

    • The fine-tuning notebook sets attn_implementation="eager", which in current Transformers uses full sequence × sequence attention instead of the more memory-efficient sliding-window behavior GPT-OSS was designed for. (OpenAI Cookbook)
    • At typical settings (e.g. batch 4, seq 1024–2048), this creates attention tensors on the order of hundreds of MiB per layer, which is exactly what fails in your OOM trace.
  3. 80GB H100 is “just enough”, so any extra overhead breaks it

    • Officially, the notebook is only guaranteed for one H100-80GB:

      “This notebook is designed to be run on a single H100 GPU with 80GB of memory. If you have access to a smaller GPU, you can reduce the batch size and sequence length…” (OpenAI Cookbook)

    • In reality, things like:

      • torch.compile/Dynamo buffers,
      • mixed-precision upcasts to FP32,
      • logging overhead,
      • and allocator fragmentation
        can easily push peak usage from “~75–78GB” to “>80GB”, causing OOM even on H100.
  4. This is a known pattern, not just your setup

    • Independent Japanese notes on GPT-OSS fine-tuning say for gpt-oss-20B LoRA in BF16 you realistically need ≥60GB VRAM (e.g. A100×2, H100), and that this is why they strongly recommend QLoRA. (Zenn)

Practical solutions / workarounds on a single H100

A. Best fix: switch to an optimized QLoRA stack (Unsloth)

  • Unsloth’s GPT-OSS guide: gpt-oss-20B fine-tuning fits in ~14GB VRAM; they recommend ≥16GB for stable runs. (Unsloth)
  • Their key point: Unsloth keeps GPT-OSS in a quantized / optimized form and uses Flex Attention, cutting VRAM by ≈70–80% compared to BF16 LoRA.

If you can change stack, this is by far the cleanest fix: your H100 then has massive headroom instead of being at the cliff edge.

B. If you must stay on HF/TRL BF16 LoRA notebook

  1. Shrink “tokens per step” hard

    • Set e.g.:

      • per_device_train_batch_size = 1
      • max_length = 512 (start small)
      • gradient_accumulation_steps = 8–16 to recover effective batch size.
    • Then gradually increase max_length and/or accumulate steps while watching nvidia-smi.

  2. Remove avoidable overhead

    • Do not use torch.compile on this model until a plain run is stable.

    • Turn off heavy logging (report_to="none").

    • Set allocator flags to reduce fragmentation, e.g.:

      export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,max_split_size_mb:512"
      
  3. Consider better attention kernels on H100

    • Where supported, replacing attn_implementation="eager" with a Hopper-optimized backend (e.g. kernels-community/vllm-flash-attn3) reduces how much L×L attention is materialized in HBM. (OpenAI Cookbook)

Even with all of this, BF16 LoRA on gpt-oss-20B will always be near the 80GB limit; QLoRA / Unsloth is the only way to make it “comfortable”.


Key links

  • OpenAI Cookbook – Fine-tuning gpt-oss with Transformers (original notebook, H100-80GB target) (OpenAI Cookbook)
  • Unsloth guide – “gpt-oss: How to run and fine-tune” (explains 65GB vs 14GB VRAM for 20B) (Unsloth Docs)
  • Zenn hardware outline for GPT-OSS fine-tuning (20B LoRA BF16 ≈ 60GB+ VRAM, QLoRA ≈ 14GB+) (Zenn)

Dear John6666,

thanks for the heads up analysis your provided.

I was reading unsloth documentation yesterday so this is probably where I am going to look next.

I’ll update the thread while I will be able to run new tests.

Kind regards

1 Like