While training my data the evaluation will give βnanβ values for perplexity. Trying to use the model will give errors that one or more tensor is βnanβ or βinfβ.
Training logs
2022-08-16T22:25:29.840Z 100%|ββββββββββ| 3169/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z 100%|ββββββββββ| 3171/3174 [02:51<00:00, 19.23it/s]
2022-08-16T22:25:29.840Z 100%|ββββββββββ| 3173/3174 [02:51<00:00, 19.25it/s]
2022-08-16T22:25:29.840Z 100%|ββββββββββ| 3174/3174 [02:51<00:00, 18.50it/s]
2022-08-16T22:25:29.840Z ***** eval metrics ***** epoch = 3.0 eval_loss = nan eval_runtime = 0:02:51.67 eval_samples = 3174 eval_samples_per_second = 18.488 eval_steps_per_second = 18.488 perplexity = nan
2022-08-16T22:25:29.840Z [INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
2022-08-16T22:25:29.840Z [INFO|modelcard.py:460] 2022-08-16 22:25:29,274 >> Dropping the following result as it does not have all the necessary fields:
2022-08-16T22:25:29.840Z {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
Loading the model for text generation will then give the following error
{
"code": 400,
"type": "InternalServerException",
"message": "probability tensor contains either `inf`, `nan` or element \u003c 0"
}
Configuration of trainer
hyperparameters = {
'model_name_or_path':"distilgpt2",
'cache_dir': '/opt/ml/cache',
'output_dir':'/opt/ml/model/skribenter',
'per_device_train_batch_size': 3,
'per_device_eval_batch_size': 1,
'evaluation_strategy': 'epoch',
'logging_strategy': 'epoch',
'num_train_epochs':3,
'save_strategy': "epoch",
'train_file': '/opt/ml/input/data/train/skribenter.txt',
'do_train': True,
'do_eval': True
# add your remaining hyperparameters
# more info here https://github.com/huggingface/transformers/tree/v4.17.0/examples/pytorch/language-modeling
}
metric_definitions = [
{"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
{"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)$"},
{"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)$"},
]
huggingface_estimator = HuggingFace(
entry_point='run_clm.py',
source_dir='./transformers/examples/pytorch/language-modeling',
instance_type='ml.g4dn.16xlarge',
instance_count=1,
learning_rate=0.0005,
# max_seq_length=1024,
gradient_accumulation_steps=2,
gradient_checkpointing=True,
padding=True,
role=role,
preprocessing_num_workers=1,
disable_tqdm=True, # to disable progress bars
#save_total_limit=1,
volume_size=900,
block_size=1024,
save_steps=200,
compiler_config = TrainingCompilerConfig(),
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
metric_definitions=metric_definitions,
hyperparameters = hyperparameters
)
A couple of notes
- Training has been tried with fp16 disabled and enabled with no change.
- run_clm.py is somewhat modified to ditch training data that is too short
I would like to verify that the training is done on real actual data. Could something else in the configuration be causing the calculations to be inf or nan?