Trainer.train() is stuck

saichandra · February 17, 2021, 9:03pm

Hi,
I’m training roberta-base using HF Trainer, but it’s stuck at the starting itself. Here’s my code -

train_dataset[0]
{'input_ids': tensor([  0, 100, 657,  ...,   1,   1,   1]),
 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]),
 'labels': tensor(0)}

val_dataset[0]
{'input_ids': tensor([    0, 11094,    14,  ...,     1,     1,     1]),
 'attention_mask': tensor([1, 1, 1,  ..., 0, 0, 0]),
 'labels': tensor(0)}

## simple test
model(train_dataset[:2]['input_ids'], attention_mask = train_dataset[:2]['attention_mask'], labels=train_dataset[:2]['labels'])
SequenceClassifierOutput(loss=tensor(0.6995, grad_fn=<NllLossBackward>), logits=tensor([[ 0.0438, -0.1893],
        [ 0.0530, -0.1786]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

train_args = transformers.TrainingArguments(
             output_dir='test_1',
             overwrite_output_dir=True,
             evaluation_strategy="epoch",
             per_device_train_batch_size=8,
             per_device_eval_batch_size=8,
             learning_rate=3e-5,
             weight_decay=0.01,
             num_train_epochs=2,
             load_best_model_at_end=True,
             )

trainer = transformers.Trainer(
             model=model,
             args=train_args,
             train_dataset=train_dataset,
             eval_dataset=val_dataset,
             tokenizer=tok,
             )

trainer.train()

I saw memory consumption and it is stuck at -

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   49C    P0    60W / 300W |   1756MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   50C    P0    61W / 300W |   1376MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Plz let me know how to proceed further…

masoudmzb · May 26, 2021, 11:37am

any progress since then?

flowxd · February 1, 2022, 3:10pm

check if works with num_worker=0 if you are using windows.

It worked for me. here is reference

hadiqa123 · August 31, 2022, 11:10am

I am also having the same problem. When training the trainer

I have used the urdu dataset of mozilla-foundation/common_voice_7.0. The model which I have used is “facebook/wav2vec2-xls-r-300m” and the libraries and transformers are
!pip install datasets ==2.4.0
!pip install transformers ==4.21.2
!pip install torchaudio = 0.11
!pip install jiwer"

If anyone could help me, I would be very thankful.

Isma · January 11, 2023, 10:59am

I ran into the same problem. I am using GCP and when I create notebook using the PyTorch, when I launch the trainer there is no progression. I manage to solve the problem by creating a choosing to create an environment without torch and by installing torch using the following line:

pip install torch==1.13.1+cu117  -f https://download.pytorch.org/whl/torch_stable.html

I hope it helps.

borghives · May 1, 2023, 4:36am

I ran into a similar problem while I was using GCP. To resolved the problem I edit my instance to have more memory from 4GB to 30GB. I ran out of system memory and the only reason I realized it was because other tasks I was running at that time (non torch/GPU related) threw error. The Trainer.train() returns no error and just hung for hours (I first thought I setup my GPU instance wrong and didn’t have the correct driver)

Topic		Replies	Views
Trainer freezes after all steps are complete (multi-gpu setting) 🤗Transformers	4	1552	February 14, 2024
Trainer.evaluate() freezing 🤗Transformers	3	508	August 23, 2024
Transformers on GCP Training stuck on start 🤗Transformers	3	1208	March 22, 2023
Trainer stuck mid epoch 🤗Transformers	0	30	August 24, 2024
Trainer.train() runs for long and appears to be stuck. How do I know that it's processing and not in loop 🤗Transformers	2	649	March 7, 2025

Trainer.train() is stuck

Related topics