I am trying to practically learn a bit about finetuning LLMs.
I use this Google Colab in the default configuration.
It works fine if I don’t change anything. But once I try to change the dataset I get this error:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-23-5cbcc1f7f806> in <cell line: 23>()
21 )
22 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 23 trainer.train()
3 frames
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1537 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1538 )
-> 1539 return inner_training_loop(
1540 args=args,
1541 resume_from_checkpoint=resume_from_checkpoint,
/usr/local/lib/python3.10/dist-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1886 optimizer_was_run = scale_before <= scale_after
1887 else:
-> 1888 self.optimizer.step()
1889 optimizer_was_run = not self.accelerator.optimizer_step_was_skipped
1890
/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py in step(self, closure)
131 self._last_scale = self.scaler.get_scale()
132 new_scale = True
--> 133 self.scaler.step(self.optimizer, closure)
134 self.scaler.update()
135 scale_after = self.scaler.get_scale()
/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py in step(self, optimizer, *args, **kwargs)
370 self.unscale_(optimizer)
371
--> 372 assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
373
374 retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
AssertionError: No inf checks were recorded for this optimizer.
How to get the colab to work with custom datasets? Either uploaded directly or uploaded to huggingface.
So something must be wrong with the dataset, right?
Well that is what I thought as well, but even if I literally upload the jsonl file from huggingface Abirate/english_quotes to my own huggingface it will still give me the same error. The data is literally the same. I don’t know how to debug this further. Please help.