Ideal loss and training values?

FocusedLoop · February 10, 2025, 5:55am

Hi!

I’m relatively new to training large language models (LLMs) but have some experience with basic machine learning. Typically, I’ve understood that achieving a very low cross-entropy loss is desirable (though not exactly zero). However, I’ve heard that this expectation differs when fine-tuning LLMs.

Currently, I’m fine-tuning a DeepSeek R1 Distill Llama 8B model on basic worded math problems sourced from Hugging Face datasets. I’m mainly just seeking some form of guidance on the following with fine tunning models.

Optimal Loss Values: What training loss and evaluation loss values should I aim for during fine-tuning? Are there specific ideal values I should be aiming for?
Indicators of Ideal Training: Beyond loss metrics, what other indicators should I monitor to ensure effective training? Are there signs that suggest overfitting or underfitting in the context of LLMs?
Best Practices: Any general advice or best practices for fine-tuning LLMs on specialized tasks like worded math problems?

I appreciate any feedback or insights. Thanks!

Config:
sft_config = SFTConfig(
output_dir=“outputs”,
dataset_text_field=“text”,
max_seq_length=512,
num_train_epochs=1,
per_device_train_batch_size=12, #ORGINAL 2
per_device_eval_batch_size=12, #ORGINAL 2
gradient_accumulation_steps=4,
optim=“adamw_8bit”,
evaluation_strategy=“steps”,
eval_steps=10,
save_steps=10,
logging_steps=10,
disable_tqdm=False,
learning_rate=1e-4, #ORGINAL 2e-4
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
save_strategy=“steps”,
save_total_limit=2,
lr_scheduler_type=“linear”,
report_to=“tensorboard”,
save_safetensors=True,
dataset_kwargs={“add_special_tokens”: False, “append_concat_token”: False},
dataloader_num_workers=8,
dataloader_pin_memory=True,
)

hrezaei · May 20, 2025, 2:58pm

These are also my questions, but I’m considering pre-training from scratch. Very important information which I haven’t found clearly reported when they publish a well-performing LLM.

Topic		Replies	Views
Fine-tuning LLM for regression yields low loss during training but not in inference? 🤗Transformers	2	4458	March 4, 2024
Loss Issues on Finetuning Beginners	0	310	February 22, 2024
Fine-tuning queries Beginners	0	38	February 20, 2025
Evaluating on MMLU while finetuning using Trainer 🤗Transformers	0	794	October 3, 2023
Question about loss calculation on LLM finetuning Research	0	7066	July 14, 2023

Ideal loss and training values?

Related topics