Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)!

Dear HF,

I have been trying to finetune the facebook/opt-13b model using the script in transformers/examples/pytorch/language-modeling. I am using 8 x 80gb a100’s on paperspace.

The script works well for finetuning the smaller models.

I keep running into a RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 0; 78.0 GiB total capacity; 😊 GiB already allocated; MiB free; 😊 cached)

This happens before any training has begun.

I have tried setting the batch size to 1, setting fp16 to true and have tried setting high gradient accumulation steps and a very low block size. Yet training still refuses to start.

I believe i should have enough vram to finetune these models, is there anything else that I should look into? Would integrating deepspeed into the run_clm script help?

Thank you!!!

You won’t be able to fine-tune such a large model without using some of the sharding for the optimizer state and gradient. You should look into the DeepSpeed integration to use Zero-2 at least.

1 Like

Awesome! Thanks so much for replying Sylvain!!

Would you be able to have a look at this set up to see if there is anything you would improve because training is very expensive and I want to fix any obvious errors before starting!

DeepSpeed Config JSON

    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto"

Run CLM Script

deepspeed \
    --deepspeed /notebooks/ds_config.json \
    --fp16 \
    --model_name_or_path facebook/opt-13b\
    --use_fast_tokenizer False\
    --train_file /notebooks/paddedForOPT3.csv \
    --per_device_train_batch_size 1 \
    --do_train \
    --per_device_eval_batch_size 1 \
    --do_eval \
    --block_size 2048 \
    --overwrite_cache true \
    --output_dir /notebooks/finetune/test-clm

Looks good at first glance!

1 Like

:smiley: Yay! Will give it a try and report back!!

Worked like a charm! Thank you @sgugger you rock!!!

1 Like