Training loss does not go down during fine-tuning

I am doing the same thing as mentioned in this article but my training loss does not go down. Things I have tried:

  • use a smaller model (gpt2 - 130M parameters)
  • use LoRa instead of QLoRa

General question I have: When fine-tuning a pretrained model, is there some kind of minimum training dataset size? E.g., if I pick a model with 10 billion parameters and run LoRa with r=1 and 0.03% trainable parameters (3M), should the training dataset have at least X samples for the fine-tuning to be meaningful? I would expect that with small X, model should easily fit the training data and training loss should go down to almost zero (over-fitting). I understand over-fitting is bad, but right now I am trying to solve the training loss not going down dilemma.

this is the kind of output I see with gpt2:

{'loss': 4.1504, 'learning_rate': 5e-06, 'epoch': 0.04}
{'loss': 4.2796, 'learning_rate': 1e-05, 'epoch': 0.08}
{'loss': 4.3042, 'learning_rate': 9.89795918367347e-06, 'epoch': 0.12}
{'loss': 4.1418, 'learning_rate': 9.795918367346939e-06, 'epoch': 0.16}
{'loss': 4.185, 'learning_rate': 9.693877551020408e-06, 'epoch': 0.2}
{'loss': 4.2192, 'learning_rate': 9.591836734693878e-06, 'epoch': 0.24}
{'loss': 4.2606, 'learning_rate': 9.489795918367348e-06, 'epoch': 0.28}
{'loss': 4.2326, 'learning_rate': 9.387755102040818e-06, 'epoch': 0.32}
{'loss': 4.2455, 'learning_rate': 9.285714285714288e-06, 'epoch': 0.36}
{'loss': 4.1771, 'learning_rate': 9.183673469387756e-06, 'epoch': 0.4}
{'loss': 4.3056, 'learning_rate': 9.081632653061225e-06, 'epoch': 0.44}
{'loss': 4.189, 'learning_rate': 8.979591836734695e-06, 'epoch': 0.48}
{'loss': 4.3775, 'learning_rate': 8.877551020408163e-06, 'epoch': 0.52}
{'loss': 4.2044, 'learning_rate': 8.775510204081633e-06, 'epoch': 0.56}
{'loss': 4.1507, 'learning_rate': 8.673469387755103e-06, 'epoch': 0.6}
{'loss': 4.2339, 'learning_rate': 8.571428571428571e-06, 'epoch': 0.64}
{'loss': 4.2225, 'learning_rate': 8.469387755102042e-06, 'epoch': 0.68}
{'loss': 4.2847, 'learning_rate': 8.36734693877551e-06, 'epoch': 0.72}
{'loss': 4.1156, 'learning_rate': 8.26530612244898e-06, 'epoch': 0.76}
{'loss': 3.9821, 'learning_rate': 8.16326530612245e-06, 'epoch': 0.8}
{'loss': 4.1366, 'learning_rate': 8.06122448979592e-06, 'epoch': 0.84}
{'loss': 4.2637, 'learning_rate': 7.959183673469388e-06, 'epoch': 0.88}
{'loss': 4.2937, 'learning_rate': 7.857142857142858e-06, 'epoch': 0.92}
{'loss': 4.3701, 'learning_rate': 7.755102040816327e-06, 'epoch': 0.96}
{'loss': 4.2257, 'learning_rate': 7.653061224489796e-06, 'epoch': 1.0}
{'loss': 4.1468, 'learning_rate': 7.551020408163265e-06, 'epoch': 1.04}
{'loss': 4.1625, 'learning_rate': 7.448979591836736e-06, 'epoch': 1.08}
{'loss': 4.0747, 'learning_rate': 7.346938775510205e-06, 'epoch': 1.12}
{'loss': 4.2389, 'learning_rate': 7.244897959183675e-06, 'epoch': 1.16}
{'loss': 4.2411, 'learning_rate': 7.1428571428571436e-06, 'epoch': 1.2}
{'loss': 4.3883, 'learning_rate': 7.0408163265306125e-06, 'epoch': 1.24}
{'loss': 4.4342, 'learning_rate': 6.938775510204082e-06, 'epoch': 1.27}
{'loss': 4.173, 'learning_rate': 6.836734693877551e-06, 'epoch': 1.31}
{'loss': 4.0996, 'learning_rate': 6.734693877551021e-06, 'epoch': 1.35}
{'loss': 4.206, 'learning_rate': 6.63265306122449e-06, 'epoch': 1.39}
{'loss': 4.1147, 'learning_rate': 6.530612244897959e-06, 'epoch': 1.43}
{'loss': 4.1747, 'learning_rate': 6.4285714285714295e-06, 'epoch': 1.47}
{'loss': 4.291, 'learning_rate': 6.326530612244899e-06, 'epoch': 1.51}
{'loss': 4.1898, 'learning_rate': 6.224489795918368e-06, 'epoch': 1.55}
{'loss': 4.1269, 'learning_rate': 6.122448979591837e-06, 'epoch': 1.59}
{'loss': 4.0671, 'learning_rate': 6.020408163265307e-06, 'epoch': 1.63}
{'loss': 4.1764, 'learning_rate': 5.918367346938776e-06, 'epoch': 1.67}
{'loss': 4.1331, 'learning_rate': 5.816326530612246e-06, 'epoch': 1.71}
{'loss': 4.1901, 'learning_rate': 5.7142857142857145e-06, 'epoch': 1.75}
{'loss': 4.1222, 'learning_rate': 5.6122448979591834e-06, 'epoch': 1.79}
{'loss': 4.3659, 'learning_rate': 5.510204081632653e-06, 'epoch': 1.83}
{'loss': 4.2379, 'learning_rate': 5.408163265306123e-06, 'epoch': 1.87}
{'loss': 4.1825, 'learning_rate': 5.306122448979593e-06, 'epoch': 1.91}
{'loss': 4.4365, 'learning_rate': 5.204081632653062e-06, 'epoch': 1.95}
{'loss': 4.4354, 'learning_rate': 5.1020408163265315e-06, 'epoch': 1.99}
{'loss': 4.1642, 'learning_rate': 5e-06, 'epoch': 2.03}
{'loss': 4.321, 'learning_rate': 4.897959183673469e-06, 'epoch': 2.07}
{'loss': 4.1739, 'learning_rate': 4.795918367346939e-06, 'epoch': 2.11}
{'loss': 4.2315, 'learning_rate': 4.693877551020409e-06, 'epoch': 2.15}
{'loss': 4.1402, 'learning_rate': 4.591836734693878e-06, 'epoch': 2.19}
{'loss': 4.2628, 'learning_rate': 4.489795918367348e-06, 'epoch': 2.23}
{'loss': 4.4122, 'learning_rate': 4.3877551020408165e-06, 'epoch': 2.27}
{'loss': 4.1045, 'learning_rate': 4.2857142857142855e-06, 'epoch': 2.31}
{'loss': 4.2417, 'learning_rate': 4.183673469387755e-06, 'epoch': 2.35}
{'loss': 4.2333, 'learning_rate': 4.081632653061225e-06, 'epoch': 2.39}
{'loss': 4.2976, 'learning_rate': 3.979591836734694e-06, 'epoch': 2.43}
{'loss': 4.137, 'learning_rate': 3.877551020408164e-06, 'epoch': 2.47}
{'loss': 4.0835, 'learning_rate': 3.7755102040816327e-06, 'epoch': 2.51}
{'loss': 4.2336, 'learning_rate': 3.6734693877551024e-06, 'epoch': 2.55}
{'loss': 4.23, 'learning_rate': 3.5714285714285718e-06, 'epoch': 2.59}
{'loss': 4.4269, 'learning_rate': 3.469387755102041e-06, 'epoch': 2.63}
{'loss': 4.2625, 'learning_rate': 3.3673469387755105e-06, 'epoch': 2.67}
{'loss': 4.0693, 'learning_rate': 3.2653061224489794e-06, 'epoch': 2.71}
{'loss': 4.1026, 'learning_rate': 3.1632653061224496e-06, 'epoch': 2.75}
{'loss': 4.2704, 'learning_rate': 3.0612244897959185e-06, 'epoch': 2.79}
{'loss': 4.3004, 'learning_rate': 2.959183673469388e-06, 'epoch': 2.83}
{'loss': 4.2444, 'learning_rate': 2.8571428571428573e-06, 'epoch': 2.87}
{'loss': 4.2163, 'learning_rate': 2.7551020408163266e-06, 'epoch': 2.91}
{'loss': 4.2658, 'learning_rate': 2.6530612244897964e-06, 'epoch': 2.95}
{'loss': 4.2456, 'learning_rate': 2.5510204081632657e-06, 'epoch': 2.99}
{'loss': 4.186, 'learning_rate': 2.4489795918367347e-06, 'epoch': 3.03}
{'loss': 4.1649, 'learning_rate': 2.3469387755102044e-06, 'epoch': 3.07}
{'loss': 4.252, 'learning_rate': 2.244897959183674e-06, 'epoch': 3.11}
{'loss': 4.2411, 'learning_rate': 2.1428571428571427e-06, 'epoch': 3.15}
{'loss': 4.3696, 'learning_rate': 2.0408163265306125e-06, 'epoch': 3.19}
{'loss': 4.1857, 'learning_rate': 1.938775510204082e-06, 'epoch': 3.23}
{'loss': 4.5284, 'learning_rate': 1.8367346938775512e-06, 'epoch': 3.27}
{'loss': 4.2942, 'learning_rate': 1.7346938775510206e-06, 'epoch': 3.31}
{'loss': 4.2246, 'learning_rate': 1.6326530612244897e-06, 'epoch': 3.35}
{'loss': 4.3536, 'learning_rate': 1.5306122448979593e-06, 'epoch': 3.39}
{'loss': 4.0673, 'learning_rate': 1.4285714285714286e-06, 'epoch': 3.43}
{'loss': 4.2783, 'learning_rate': 1.3265306122448982e-06, 'epoch': 3.47}
{'loss': 4.1887, 'learning_rate': 1.2244897959183673e-06, 'epoch': 3.51}
{'loss': 4.0983, 'learning_rate': 1.122448979591837e-06, 'epoch': 3.55}
{'loss': 4.1175, 'learning_rate': 1.0204081632653063e-06, 'epoch': 3.59}
{'loss': 4.1917, 'learning_rate': 9.183673469387756e-07, 'epoch': 3.63}
{'loss': 4.0768, 'learning_rate': 8.163265306122449e-07, 'epoch': 3.67}
{'loss': 4.1162, 'learning_rate': 7.142857142857143e-07, 'epoch': 3.71}
{'loss': 4.1685, 'learning_rate': 6.122448979591837e-07, 'epoch': 3.75}
{'loss': 4.269, 'learning_rate': 5.102040816326531e-07, 'epoch': 3.78}
{'loss': 4.2444, 'learning_rate': 4.0816326530612243e-07, 'epoch': 3.82}
{'loss': 4.2409, 'learning_rate': 3.0612244897959183e-07, 'epoch': 3.86}
{'loss': 4.4265, 'learning_rate': 2.0408163265306121e-07, 'epoch': 3.9}
{'loss': 4.1445, 'learning_rate': 1.0204081632653061e-07, 'epoch': 3.94}
{'loss': 4.1979, 'learning_rate': 0.0, 'epoch': 3.98}
{'train_runtime': 78.0494, 'train_samples_per_second': 128.124, 'train_steps_per_second': 1.281, 'train_loss': 4.222590186595917, 'epoch': 3.98}