I’m trying to perform Domain Adaptation on llama2 in AWS using Huggingface estimator.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=fault_tolerance_data_collator,
)
train_result = trainer.train()
I’m getting the following error:
train_result = trainer.train()File “/opt/conda/lib/python3.10/site-packages/transformers/trainer.py”, line 1526, in train
return inner_training_loop(
File “/opt/conda/lib/python3.10/site-packages/transformers/trainer.py”, line 1796, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)File “/opt/conda/lib/python3.10/site-packages/transformers/trainer.py”, line 2641, in training_step
loss = self.compute_loss(model, inputs)File “/opt/conda/lib/python3.10/site-packages/transformers/trainer.py”, line 2666, in compute_loss
outputs = model(**inputs)File “/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)File “/opt/conda/lib/python3.10/site-packages/peft/peft_model.py”, line 1091, in forward
return self.base_model(File “/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)File “/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py”, line 160, in forward
return self.model.forward(*args, **kwargs)File “/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 756, in forward
outputs = self.model(File “/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 1501, in _call_impl
return forward_call(*args, **kwargs)File “/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 603, in forward
attention_mask = self._prepare_decoder_attention_mask(File “/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 531, in _prepare_decoder_attention_mask
combined_attention_mask = _make_causal_mask(File “/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py”, line 49, in _make_causal_mask
mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)RuntimeError: CUDA error: device-side assert triggeredCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have used same code in Google colab with smaller model and it worked perfectly. But I’m getting error in AWS with llama2 and smaller model also.
I also checked model token embeddings tokenizer length, both are same.
model_vocab_size = model.get_output_embeddings().weight.size(0)
print(model_vocab_size)
32000
tokenizer_vocab_size = len(tokenizer)
print(tokenizer_vocab_size)
32000
Please help in solving the issue. Thanks in advance!