Nan in tensors and evaluation for GPT2 finetuning (clm)

nittonfemton · August 17, 2022, 8:19am

While training my data the evaluation will give “nan” values for perplexity. Trying to use the model will give errors that one or more tensor is “nan” or “inf”.

Training logs

2022-08-16T22:25:29.840Z	100%|█████████▉| 3169/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z	100%|█████████▉| 3171/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z	100%|█████████▉| 3173/3174 [02:51<00:00, 19.25it/s]
2022-08-16T22:25:29.840Z	100%|██████████| 3174/3174 [02:51<00:00, 18.50it/s]
2022-08-16T22:25:29.840Z	***** eval metrics ***** epoch = 3.0 eval_loss = nan eval_runtime = 0:02:51.67 eval_samples = 3174 eval_samples_per_second = 18.488 eval_steps_per_second = 18.488 perplexity = nan
2022-08-16T22:25:29.840Z	[INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z	{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
2022-08-16T22:25:29.840Z	[INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z	{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Loading the model for text generation will then give the following error

{
  "code": 400,
  "type": "InternalServerException",
  "message": "probability tensor contains either `inf`, `nan` or element \u003c 0"
}

Configuration of trainer

hyperparameters = {
	'model_name_or_path':"distilgpt2",
    'cache_dir': '/opt/ml/cache',
	'output_dir':'/opt/ml/model/skribenter',
    'per_device_train_batch_size': 3,
    'per_device_eval_batch_size': 1,
    'evaluation_strategy': 'epoch',
    'logging_strategy': 'epoch',
    'num_train_epochs':3,
    'save_strategy': "epoch",
    'train_file': '/opt/ml/input/data/train/skribenter.txt',
    'do_train': True,
    'do_eval': True
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/language-modeling
}
metric_definitions = [
    {"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
    {"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)$"},
    {"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)$"},
]
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./transformers/examples/pytorch/language-modeling',
	instance_type='ml.g4dn.16xlarge',
	instance_count=1,
    learning_rate=0.0005,
#    max_seq_length=1024,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    padding=True,
	role=role,
    preprocessing_num_workers=1,
    disable_tqdm=True, # to disable progress bars
    #save_total_limit=1,
    volume_size=900,
    block_size=1024,
    save_steps=200,
    compiler_config = TrainingCompilerConfig(),
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
    metric_definitions=metric_definitions,
	hyperparameters = hyperparameters
)

A couple of notes

Training has been tried with fp16 disabled and enabled with no change.
run_clm.py is somewhat modified to ditch training data that is too short

I would like to verify that the training is done on real actual data. Could something else in the configuration be causing the calculations to be inf or nan?

codelion · April 2, 2023, 4:01am

I am having the same issue with the official fine-tuning example scripts. Did anyone figure out how to solve it?

Topic		Replies	Views
Segformer fine-tuning: error with the metrics Beginners	7	1188	October 31, 2022
`nan` training loss but eval loss does improve over time Research	5	4006	October 10, 2022
T5 variants return Training Loss 0 and Validation loss nan while fine tuning 🤗Transformers	8	5413	November 10, 2024
I'm getting "nan" value for loss, while following a tutorial from the documentatin 🤗Transformers	0	670	October 14, 2020
Evaluation results in training GPT-2 on WikiText-2 Beginners	4	1628	April 14, 2021

Nan in tensors and evaluation for GPT2 finetuning (clm)

Related topics