Hi!
I’m relatively new to training large language models (LLMs) but have some experience with basic machine learning. Typically, I’ve understood that achieving a very low cross-entropy loss is desirable (though not exactly zero). However, I’ve heard that this expectation differs when fine-tuning LLMs.
Currently, I’m fine-tuning a DeepSeek R1 Distill Llama 8B model on basic worded math problems sourced from Hugging Face datasets. I’m mainly just seeking some form of guidance on the following with fine tunning models.
- Optimal Loss Values: What training loss and evaluation loss values should I aim for during fine-tuning? Are there specific ideal values I should be aiming for?
- Indicators of Ideal Training: Beyond loss metrics, what other indicators should I monitor to ensure effective training? Are there signs that suggest overfitting or underfitting in the context of LLMs?
- Best Practices: Any general advice or best practices for fine-tuning LLMs on specialized tasks like worded math problems?
I appreciate any feedback or insights. Thanks!
Config:
sft_config = SFTConfig(
output_dir=“outputs”,
dataset_text_field=“text”,
max_seq_length=512,
num_train_epochs=1,
per_device_train_batch_size=12, #ORGINAL 2
per_device_eval_batch_size=12, #ORGINAL 2
gradient_accumulation_steps=4,
optim=“adamw_8bit”,
evaluation_strategy=“steps”,
eval_steps=10,
save_steps=10,
logging_steps=10,
disable_tqdm=False,
learning_rate=1e-4, #ORGINAL 2e-4
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
save_strategy=“steps”,
save_total_limit=2,
lr_scheduler_type=“linear”,
report_to=“tensorboard”,
save_safetensors=True,
dataset_kwargs={“add_special_tokens”: False, “append_concat_token”: False},
dataloader_num_workers=8,
dataloader_pin_memory=True,
)