I see, the dataset could also be a possible cause…
Well, the best practices for datasets are probably available in this forum or on GitHub if you search for them…
Also, depending on the model, gradient checking may not be available (I think it should be available in Llama 3.2 1B, though…), and there may still be some potential bugs in multi-GPU environments.
When trying to isolate the issue, it’s usually faster to temporarily switch to a smaller, simpler model or dataset.
opened 02:40PM - 25 Nov 23 UTC
closed 03:05PM - 07 Jan 24 UTC
Hi,
I'm trying to supervised fine-tune a phi-1.5B model on a custom dataset w… ith the SFTTrainer, my script closely follows the [sft_llama2.py](https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama_2/scripts/sft_llama2.py).
I'm training the model on 4x2080Ti (11G), the model 1.3B params should comfortable fit it the combined VRAM of the GPU's but I see a CUDA OOM errors when I start my training.
my hyper parameters are as follows:
```python
per_device_train_batch_size: Optional[int] = field(default=1, metadata={"help": "The batch size per GPU."})
per_device_eval_batch_size: Optional[int] = field(default=1, metadata={"help": "The batch size per GPU for evaluation."})
gradient_accumulation_steps: Optional[int] = field(default=8, metadata={"help": "The number of gradient accumulation steps."})
gradient_checkpointing: Optional[bool] = field(default=False, metadata={"help": "Whether to use gradient checkpointing."})
```
furthermore , this is how I'm instantiating my model:
```python
self.base_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=self.script_args.model_name,
quantization_config=self.bnb_config,
device_map="auto",#{"": Accelerator().local_process_index},
trust_remote_code=True,
# torch_dtype=torch.float16,
# use_flash_attention_2=False
)
```
I cannot use PEFT or Gradient Checkpointing as Phi models are not supported.
Hardware:
CPU: Xeon® E5-2630 v2 but limited to 16GB as this is what the vast.ai instance has.
GPU: 4x A40 → Total of 180GB
OS
Linux
python
3.10
cuda
12.2
packages:
torch==2.3.1
transformers==4.41.2
peft==0.11.1
datasets==2.20.0
accelerate==0.31.0
evaluate==0.4.1
bitsandbytes==0.43.1
huggingface_hub==0.23.4
trl==0.9.4
Issue
Introduction
Hi!
I’m trying to fine-tune LLama3-8B on a summarization dataset of about 1500 instances. The dataset contains long documents, often over 8K tokens. I…
Trying to sft Qwen2.5vl-3b-instruct but I get this same error over and over again, I’ve looked at all the past threads and tried their solutions but its just not working. I don’t think downgrading to a smaller model will do any good because the error comes during attention which is quadratic with respect to N not model size.
Maybe its an issue with my collate_fn but I can’t find anything, I’ve even chomped down max_token_length to 1024 and its the same error so I feel like there’s something els…