Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)!

Dear HF,

I have been trying to finetune the facebook/opt-13b model using the script in transformers/examples/pytorch/language-modeling. I am using 8 x 80gb a100’s on paperspace.

The script works well for finetuning the smaller models.

I keep running into a RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 0; 78.0 GiB total capacity; 😊 GiB already allocated; MiB free; 😊 cached)

This happens before any training has begun.

I have tried setting the batch size to 1, setting fp16 to true and have tried setting high gradient accumulation steps and a very low block size. Yet training still refuses to start.

I believe i should have enough vram to finetune these models, is there anything else that I should look into? Would integrating deepspeed into the run_clm script help?

Thank you!!!

You won’t be able to fine-tune such a large model without using some of the sharding for the optimizer state and gradient. You should look into the DeepSpeed integration to use Zero-2 at least.

1 Like

Awesome! Thanks so much for replying Sylvain!!

Would you be able to have a look at this set up to see if there is anything you would improve because training is very expensive and I want to fix any obvious errors before starting!

DeepSpeed Config JSON

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"

Run CLM Script

deepspeed \
    --deepspeed /notebooks/ds_config.json \
    --fp16 \
    --model_name_or_path facebook/opt-13b\
    --use_fast_tokenizer False\
    --train_file /notebooks/paddedForOPT3.csv \
    --per_device_train_batch_size 1 \
    --do_train \
    --per_device_eval_batch_size 1 \
    --do_eval \
    --block_size 2048 \
    --overwrite_cache true \
    --output_dir /notebooks/finetune/test-clm

Looks good at first glance!

1 Like

:smiley: Yay! Will give it a try and report back!!

Worked like a charm! Thank you @sgugger you rock!!!

1 Like

Hi @anujn, May I know how much RAM did you use? According to DeepSpeed, it needs 581.15GB per CPU.

from estimate_zero2_model_states_mem_needs_all_cold;
estimate_zero2_model_states_mem_needs_all_cold(total_params=13e9, num_gpus_p_node=8, num_nodes=1)

Here is the result. It seems a little crazy if I want to train a bigger OPT model.

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 13000M total params.
per CPU | per GPU | Options
581.15GB | 24.21GB | offload_optimizer=cpu
581.15GB | 72.64GB | offload_optimizer=none