Trainer.train() runs for long and appears to be stuck. How do I know that it's processing and not in loop

Hello all, I am trying to learn finetuning a pretrained LLM. The pre-trained LLM is mistralai/Mistral-7B-v0.1. I also used PEFT config to create a PEFTModel of it. I used a simple PDF which is only 2 pages just for practice. In the tokenized_dataset, I have input_ids, attention mask and labels. The label is same as input_ids.

Here is the code snippet.
training_args = TrainingArguments(
output_dir=“xxxxxxxxx/test-mistral-peft”,
learning_rate=1e-3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy=“epoch”,
save_strategy=“epoch”,
load_best_model_at_end=True,
remove_unused_columns=False,
auto_find_batch_size=True,
)

from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets[‘train’],
eval_dataset=tokenized_datasets[‘test’],
tokenizer=tokenizer,
)
trainer.train()

The train method is running for 3 hrs and 25 mins as of this writing. It shows this on the google colab status bar. <cell line: 1>train()>decorator()>_inner_training_loop()> training_step()>backward()>backward()>backward()>_engine_run_backward() for long time. When I mouse over these function names, it shows the location (line#) of the python code from some libraries. They don’t change as well.

I understand that the finetuning is mostly a long process and it may go for a couple of hours. The colab run time uses cpu instead of gpu/tpu.

Please let me know how do I know that the code is running and not stuck in infinite loop.

It created a file events.out.tfevents.1725835604.d56bc1eb2b0b.1874.0 (4.83K) size and it does not seem to grow in size.

The system RAM and Disk usage is normal and flat mostly.

Please advise. Thanks.

1 Like

Ok. Now, it has been running for 5 hr 22 mins. It appears to be running fine (I mean not stuck or looping) as I see a different function names in the status bar. Nevertheless, please let me know if any of you have any suggestions. Thanks for the suggestions in advance.

1 Like

Hi RIshi,
What was the conclusion??
were you able to fix it?

1 Like