Finetuning Llama 13B with my own dataset

I want to finutune Llama 13B with my own dataset (about 200k items followed alpaca dataset format). I use the code and the training script from alpaca. However, when I use 1 GPU, 2 GPUs, or 4 GPUs, it works well, but when I use 8 GUPs, it fails. All of them are 80G A100.
The training scripts are as follow:

torchrun --nproc_per_node=8  train.py \
        --model_name_or_path llama-13b-hf \
        --data_path my_path \
        --bf16 True \
        --output_dir debug/ \
        --model_max_length 1024 \
        --num_train_epochs 3 \
        --per_device_train_batch_size 1 \
        --per_device_eval_batch_size 4 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 2000 \
        --save_total_limit 1 \
        --learning_rate 1e-5 \
        --weight_decay 0. \
        --warmup_ratio 0.03 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --deepspeed "./configs/default_offload_opt_param.json" \

The error looks like this:

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4253 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4254 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4255 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4256 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4257 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4258 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4259 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 4260) of binary: /miniconda3/envs/alpaca/bin/python
Traceback (most recent call last):
  File "/miniconda3/envs/alpaca/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/miniconda3/envs/alpaca/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/miniconda3/envs/alpaca/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/miniconda3/envs/alpaca/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/miniconda3/envs/alpaca/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/miniconda3/envs/alpaca_lmc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-25_02:50:29
  host      : 
  rank      : 7 (local_rank: 7)
  exitcode  : -9 (pid: 4260)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 4260
=======================================================

Additionaly, when I choose Llama 7B, 8 GPUs works well. It’s so weired!!!
I am sure my CPU RAM is enough(about 500G), not this reason.

Here are some information about my mechine and my virtual environment.

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

pip freeze > requirements.txt

absl-py==1.4.0
accelerate==0.20.3
aiohttp==3.8.4
aiosignal==1.3.1
appdirs==1.4.4
async-timeout==4.0.2
attrs==23.1.0
certifi==2023.5.7
charset-normalizer==3.1.0
click==8.1.3
cmake==3.26.4
deepspeed==0.9.4
docker-pycreds==0.4.0
filelock==3.12.2
fire==0.5.0
frozenlist==1.3.3
fsspec==2023.6.0
gitdb==4.0.10
GitPython==3.1.31
hjson==3.1.0
huggingface-hub==0.15.1
idna==3.4
Jinja2==3.1.2
joblib==1.2.0
lit==16.0.6
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
ninja==1.11.1
nltk==3.8.1
numpy==1.24.3
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
openai==0.27.8
packaging==23.1
pathtools==0.1.2
protobuf==4.23.3
psutil==5.9.5
py-cpuinfo==9.0.0
pydantic==1.10.9
PyYAML==6.0
regex==2023.6.3
requests==2.31.0
rouge-score==0.1.2
safetensors==0.3.1
sentencepiece==0.1.99
sentry-sdk==1.25.1
setproctitle==1.3.2
six==1.16.0
smmap==5.0.0
sympy==1.12
termcolor==2.3.0
tokenizers==0.13.3
torch==2.0.1
tqdm==4.65.0
transformers==4.30.2
triton==2.0.0
typing_extensions==4.6.3
urllib3==2.0.3
wandb==0.15.4
yarl==1.9.2

The error occurs in constructing model(sometime) or loading pretrained weight(mostly), I have made some model modify, so I need to constructe model and load pretrained weight manually.

It seems the error about the number of GPUs? Hope for help, thanks!!!

1 Like

My fault! It’s really because of CPU RAM out of memory :hot_face: Using transformers save_pretrained and from_pretrained API will solve this problem when running with multi-process.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.