Hello, I am fine-tuning GPT-J-6B with each input data point = 2048 tokens. My loss decreases, albeit at a very slow rate. I was hoping to understand if the size of 2048 might be an issue? Is there a correlation between max. size and the loss? Should I decrease the max. size to a lower number (ex. 512) and try again?