I’ve been trying to use run_clm.py from the language modeling example section to fine tune GPT-neo 125m on a small data set I put together. (around 10 megs txt file) I’ve been trying to use sage maker to speed up the process using the example script given under the GPT-neo-125 train section. The problem I ran into was that I get the following message during training: “ValueError: expected sequence of length 1024 at dim 1 (got 507).” From what I can tell, the run_clm.py function group_texts is supposed to drop text that doesn’t conform to length requirements.
I started writing this post feeling at wit’s end, but as it turns out I managed to find a solution to the issue. It turns out that on lines 417-418 of run_clm.py there was a check that was allowing smaller inputs through. I’ve fixed it on my own branch seeing how I came across this board through my google searches, hopefully, someone else out there will find this useful