Run_clm.py stops after some % with error

Hey i’m trying to fine tune a german model. Fine tuning worked previously with the gpt-medium and a small input.txt (around 100kb).

Now i try to fine tune dbmdz/german with a dataset of ~3mb (Some fiction books i pasted in the .txt file).

after some % i get a wall of text and the following error:

“C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [88,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [89,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [90,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [91,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [92,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [93,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [94,0,0] Assertion srcIndex < srcSelectDimSize failed.
C:/w/b/windows/pytorch/aten/src/ATen/native/cuda/Indexing.cu:662: block: [8,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File “run_clm.py”, line 407, in
main()
File “run_clm.py”, line 376, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File “E:\anaconda\lib\site-packages\transformers\trainer.py”, line 940, in train
tr_loss += self.training_step(model, inputs)
File “E:\anaconda\lib\site-packages\transformers\trainer.py”, line 1302, in training_step
loss = self.compute_loss(model, inputs)
File “E:\anaconda\lib\site-packages\transformers\trainer.py”, line 1334, in compute_loss
outputs = model(**inputs)
File “E:\anaconda\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “E:\anaconda\lib\site-packages\transformers\models\gpt2\modeling_gpt2.py”, line 899, in forward
transformer_outputs = self.transformer(
File “E:\anaconda\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “E:\anaconda\lib\site-packages\transformers\models\gpt2\modeling_gpt2.py”, line 689, in forward
inputs_embeds = self.wte(input_ids)
File “E:\anaconda\lib\site-packages\torch\nn\modules\module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “E:\anaconda\lib\site-packages\torch\nn\modules\sparse.py”, line 145, in forward
return F.embedding(
File “E:\anaconda\lib\site-packages\torch\nn\functional.py”, line 1913, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered”

I already tried altering --block_size and --per_device_train_batch_size 1 but nothing seems to help.

Ok, it seems to be a problem with the text in the train.txt file. If i swap the text everything is fine. What can it be=

Ok, found a solution. It seems the script doesn’t like when i seperate the texts with <|endoftext|>

@Stefan , I am having this same issue. What did you use to separate the texts?