Nan in tensors and evaluation for GPT2 finetuning (clm)

While training my data the evaluation will give β€œnan” values for perplexity. Trying to use the model will give errors that one or more tensor is β€œnan” or β€œinf”.

Training logs

2022-08-16T22:25:29.840Z	100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 3169/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z	100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 3171/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z	100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 3173/3174 [02:51<00:00, 19.25it/s]
2022-08-16T22:25:29.840Z	100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3174/3174 [02:51<00:00, 18.50it/s]
2022-08-16T22:25:29.840Z	***** eval metrics ***** epoch = 3.0 eval_loss = nan eval_runtime = 0:02:51.67 eval_samples = 3174 eval_samples_per_second = 18.488 eval_steps_per_second = 18.488 perplexity = nan
2022-08-16T22:25:29.840Z	[INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z	{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
2022-08-16T22:25:29.840Z	[INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z	{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Loading the model for text generation will then give the following error

{
  "code": 400,
  "type": "InternalServerException",
  "message": "probability tensor contains either `inf`, `nan` or element \u003c 0"
}

Configuration of trainer

hyperparameters = {
	'model_name_or_path':"distilgpt2",
    'cache_dir': '/opt/ml/cache',
	'output_dir':'/opt/ml/model/skribenter',
    'per_device_train_batch_size': 3,
    'per_device_eval_batch_size': 1,
    'evaluation_strategy': 'epoch',
    'logging_strategy': 'epoch',
    'num_train_epochs':3,
    'save_strategy': "epoch",
    'train_file': '/opt/ml/input/data/train/skribenter.txt',
    'do_train': True,
    'do_eval': True
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/language-modeling
}
metric_definitions = [
    {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
    {"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)$"},
    {"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)$"},
]
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./transformers/examples/pytorch/language-modeling',
	instance_type='ml.g4dn.16xlarge',
	instance_count=1,
    learning_rate=0.0005,
#    max_seq_length=1024,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    padding=True,
	role=role,
    preprocessing_num_workers=1,
    disable_tqdm=True, # to disable progress bars
    #save_total_limit=1,
    volume_size=900,
    block_size=1024,
    save_steps=200,
    compiler_config = TrainingCompilerConfig(),
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
    metric_definitions=metric_definitions,
	hyperparameters = hyperparameters
)

A couple of notes

  • Training has been tried with fp16 disabled and enabled with no change.
  • run_clm.py is somewhat modified to ditch training data that is too short

I would like to verify that the training is done on real actual data. Could something else in the configuration be causing the calculations to be inf or nan?