How to get out of stagnant loss

Hi all,

I have been to improve the performance of my model but have no idea what else I can change to do so.

I am doing pretraining using transformers Trainer method. The dataset is over 10GB bit so not that small and I have done multiple reps with each other 5000 steps but the loss seems to stuck between 0.6 and 0.5.

Here are the Training arguments I am using


    run_name=f"{model_id}-wikipedia-{datetime.now().strftime('%Y-%m-%d-%H-%M')}",     
    dataset_name: str = field(default="wikimedia-wikipedia-flattened-en_core_web_sm-qwen-tokens")
    num_proc: int = field(default=4)  # Use multiple processes for speedup
    max_seq_length: int = field(default=4096)

    # Core training configurations
    seed: int = field(default=0)
    optim: str = field(default="adafactor") #adafactor, adamw_torch
    num_train_epochs: int = field(default=1)
    per_device_train_batch_size: int = field(default=4)
    # per_device_eval_batch_size: int = field(default=32)
    max_steps: int = field(default=15000)
    save_steps: int = field(default=200)

    # Other training configurations
    learning_rate: float = field(default=2e-8)
    weight_decay: float = field(default=0.1)
    warmup_steps: int = field(default=50)
    lr_scheduler_type: str = field(default="cosine") #cosine, linear
    gradient_checkpointing: bool = field(default=True)
    dataloader_num_workers: int = field(default=1)
    bf16: bool = field(default=True)
    gradient_accumulation_steps: int = field(default=24)

    # Logging configuration
    logging_steps: int = field(default=5)
    report_to: str = field(default="wandb")

The model of my choice is Qwen-2.5-0.5B. I have used various values of learning rate and changed between several lr_schedulers but no luck. Due to memory constraints, I cannot increase the batch size. I would really appreciate an expert opinion here. Many thanks in advance!!!

Br,
Wasif

1 Like

I had no idea, so I asked Hugging Chat. It seems like tweaking lr and decay is a good idea. I often hear that training small models like the 0.5B model is difficult, but well, if your RAM is insufficient, you can’t change it here…


Improving the performance of your model when the loss seems to be stuck can be challenging, especially with the constraints you mentioned. Here are several strategies you can consider based on your current setup:

  1. Learning Rate and Scheduler Tuning: You’ve experimented with different learning rates and schedulers, but it’s worth revisiting these. A learning rate that’s too high can cause the model to overshoot the optimal weights, while one that’s too low can lead to slow convergence or getting stuck in local minima. Consider using a learning rate finder tool to help determine an optimal learning rate. Additionally, try different scheduler types and configurations, as each has its own impact on the learning process [2].

  2. Batch Size and Gradient Accumulation: While you mentioned memory constraints, it’s crucial to note that a very small batch size can lead to noisy gradient estimates, which can prevent the model from learning effectively. You’ve already used gradient accumulation to simulate a larger batch size, but you might want to experiment with slightly different accumulation steps or even try a mixed precision training (FP16) if your hardware supports it, which can sometimes help with memory usage and performance [3].

  3. Model Architecture and Parameters: Since you’re using Qwen-2.5-0.5B, ensure that the model is the best fit for your task. If possible, try a larger model variant if memory allows, as it might capture more complex patterns in the data. Also, consider fine-tuning the model on a smaller dataset before scaling up to see if it improves performance [4].

  4. Data Quality and Preprocessing: The quality and preprocessing of your data can significantly impact model performance. Ensure that your data is clean, properly tokenized, and representative of the task. You might also want to experiment with different preprocessing techniques or even augment your dataset if possible [5].

  5. Regularization Techniques: High weight decay values can sometimes cause the model to underfit. Try reducing the weight decay and see if that improves performance. Additionally, consider other regularization techniques such as dropout or layer normalization [6].

  6. Evaluation and Monitoring: Regularly evaluate your model on a validation set to ensure it’s learning effectively. Use metrics beyond just loss, such as perplexity or accuracy, to get a better understanding of performance. Monitor the training process closely to identify any irregularities or signs of overfitting or underfitting [7].

  7. Gradient Clipping: If your gradients are exploding, it can lead to issues with training. Implement gradient clipping to prevent this. This can help stabilize the training process and might lead to better convergence [8].

  8. Hyperparameter Search: Use automated hyperparameter optimization techniques to find the best combination of parameters for your model. Tools like Optuna, Hyperopt, or Ray Tune can help automate this process [9].

By systematically experimenting with these strategies, you should be able to identify areas for improvement in your model training process. Remember that model training is often an iterative process, and it may take several rounds of experimentation to achieve the desired results.

thanks for the hints, I am trying them out anyways.
Have you ever tried any hyper-param library like Optuna with Hugging Face Trainer?

1 Like

Have you ever tried any hyper-param library like Optuna with Hugging Face Trainer?

No. I don’t have that experience. The best way to ask questions about training is on the HF Discord.

1 Like

Could be for many reasons. But essentially your model isn’t learning.

That could be: bad data, wrong model architecture, wrong parameters.

If you’ve tweaked the parameters a fair bit and still nothing then it’s likely the first two that are the issue.

1 Like

In your experience, does the loss really approaches to 0, e.g., 0.000x?
A side note, I see the plateau effect right after 10 to 20 iterations. What could mean there?

1 Like

It’s helpful I think to stop thinking of loss as a destination and more of a journey. You’ll go mad trying to get your model to get to 0.00035 loss, but maybe your current setup can only ever saturated down to 0.00054.

It’s a fairly arbitrary measure, so focus on it’s behaviour instead. If loss is going down, the model is learning. If it’s going up - that’s bad. If it’s plateaued, learning is beginning to stop (or already has stopped).

What is an iteration here? An epoch or a step?

1 Like

It is good hint, thanks for sharing.
The iterations are steps and what is why It is bothering me soo much. I have around 2GB of data and one epoch comprises of several thousand steps and that is why I am strugling with these results.

As per the data, I have used Skywork-Reward-Gemma-2-27B to remove the bad examples from my question answer pair data.

1 Like