Hello all, I am trying to learn finetuning a pretrained LLM. The pre-trained LLM is mistralai/Mistral-7B-v0.1. I also used PEFT config to create a PEFTModel of it. I used a simple PDF which is only 2 pages just for practice. In the tokenized_dataset, I have input_ids, attention mask and labels. The label is same as input_ids.
Here is the code snippet.
training_args = TrainingArguments(
output_dir=âxxxxxxxxx/test-mistral-peftâ,
learning_rate=1e-3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=2,
weight_decay=0.01,
eval_strategy=âepochâ,
save_strategy=âepochâ,
load_best_model_at_end=True,
remove_unused_columns=False,
auto_find_batch_size=True,
)
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets[âtrainâ],
eval_dataset=tokenized_datasets[âtestâ],
tokenizer=tokenizer,
)
trainer.train()
The train method is running for 3 hrs and 25 mins as of this writing. It shows this on the google colab status bar. <cell line: 1>train()>decorator()>_inner_training_loop()> training_step()>backward()>backward()>backward()>_engine_run_backward() for long time. When I mouse over these function names, it shows the location (line#) of the python code from some libraries. They donât change as well.
I understand that the finetuning is mostly a long process and it may go for a couple of hours. The colab run time uses cpu instead of gpu/tpu.
Please let me know how do I know that the code is running and not stuck in infinite loop.
It created a file events.out.tfevents.1725835604.d56bc1eb2b0b.1874.0 (4.83K) size and it does not seem to grow in size.
The system RAM and Disk usage is normal and flat mostly.
Please advise. Thanks.