Storage Full while finetuning with 8gpu 1tb and s3 bucket

anon47283947 · February 20, 2023, 3:28pm

deepspeed --num_gpus=8 run_clm.py --deepspeed ds_config_stage3.json --model_name_or_path EleutherAI/pythia-12b-deduped --dataset_name wikipedia --dataset_config_name 20220301.en --do_train --do_eval --fp16 --overwrite_cache --evaluation_strategy=steps --num_train_epochs 10 --eval_steps 20 --gradient_accumulation_steps 32 --per_device_train_batch_size 1 --use_fast_tokenizer False --learning_rate 0.001 --warmup_steps 10 --save_total_limit 1 --save_steps 20 --save_strategy steps --tokenizer_name gpt2 --load_best_model_at_end=True --block_size=2048
–cache_dir s3://prd-stk-tsr-clearml/ --output_dir s3://prd-stk-tsr-clearml/finetune12B/

I am using run_clm.py script and want to use my s3 bucket for finetuning as my local is 1tb and it throws error of storage full.
Even after pointing cache directory and output directory it crashes due to out of storage.

OSError: [Errno 28] No space left on device83%|█████████████████████████████████████████████████████████████████████████████████████████████████▎ | 5063/6136 [4:59:39<1:03:30, 3.55s/ba]

8GPUS and 1TB space.

anon47283947 · February 20, 2023, 3:28pm

Topic		Replies	Views
Finetune LLM with DeepSpeed DeepSpeed	2	5121	February 22, 2024
Fine-tune OPT 13B: CUDA out of memory error (720gb vram, batch size 1, fp16)! Beginners	6	4562	July 25, 2022
Running out of Memory with run_clm.py Beginners	3	1680	December 14, 2022
Issues with using DeepSpeed on multiple GPUs DeepSpeed	2	2530	September 9, 2022
Incorrect total train batch size when using tp_size > 1 and deepspeed DeepSpeed	1	54	May 20, 2025

Storage Full while finetuning with 8gpu 1tb and s3 bucket

Related topics