Finetune bart for text summary has nan loss

hi,

I‘m using Sagemaker to finetune on amazon-review(chinese) for text summary and trying to use mt5 as backbone. but got loss nan, and wondering why

training code:


import sagemaker
import boto3
from sagemaker.huggingface import HuggingFace

# gets role for executing training job
#iam_client = boto3.client('iam')

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {sess.default_bucket()}")

hyperparameters = {
'model_name_or_path':'google/mt5-small',
'output_dir':'/opt/ml/model',
'dataset_name': 'amazon_reviews_multi',
'dataset_config_name': 'zh',
'output_dir': '/opt/ml/model',
'do_train': True,
'do_eval': True,
'do_predict': True,
'predict_with_generate': True,
'num_train_epochs': 5,
'learning_rate': 5e-5,
'seed': 7,
'fp16': True,

# add your remaining hyperparameters
# more info here https://github.com/huggingface/transformers/tree/v4.6.1/examples/pytorch/seq2seq
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_summarization.py',
source_dir='/home/ec2-user/SageMaker/transformers/examples/seq2seq',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit()

log:

IAM role arn used for running training: arn:aws:iam::847380964353:role/spot-bot-SpotSageMakerExecutionRole-917OYJPI7O18
S3 bucket used for storing artifacts: sagemaker-us-west-2-847380964353
2021-09-16 05:19:41 Starting - Starting the training job...
2021-09-16 05:20:05 Starting - Launching requested ML instancesProfilerReport-1631769575: InProgress
...
2021-09-16 05:20:39 Starting - Preparing the instances for training............
2021-09-16 05:22:29 Downloading - Downloading input data
2021-09-16 05:22:29 Training - Downloading the training image.................bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2021-09-16 05:25:28,430 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2021-09-16 05:25:28,454 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2021-09-16 05:25:31,492 sagemaker_pytorch_container.training INFO     Invoking user training script.
2021-09-16 05:25:31,980 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:
/opt/conda/bin/python3.6 -m pip install -r requirements.txt

2021-09-16 05:25:27 Training - Training image download completed. Training in progress.Requirement already satisfied: datasets>=1.1.3 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.6.2)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (0.1.91)
Requirement already satisfied: protobuf in /opt/conda/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.17.1)
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Collecting nltk
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
Requirement already satisfied: pandas in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.1.5)
Requirement already satisfied: requests>=2.19.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.25.1)
Requirement already satisfied: fsspec in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2021.5.0)
Requirement already satisfied: huggingface-hub<0.1.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.0.8)
Requirement already satisfied: xxhash in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: tqdm<4.50.0,>=4.27 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.49.0)
Requirement already satisfied: dill in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.3.3)
Requirement already satisfied: multiprocess in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.70.11.1)
Requirement already satisfied: dataclasses in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (0.8)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.1)
Requirement already satisfied: packaging in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (20.9)
Requirement already satisfied: pyarrow>=1.0.0<4.0.0 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: numpy>=1.17 in /opt/conda/lib/python3.6/site-packages (from datasets>=1.1.3->-r requirements.txt (line 1)) (1.19.1)
Requirement already satisfied: regex in /opt/conda/lib/python3.6/site-packages (from sacrebleu>=1.4.12->-r requirements.txt (line 4)) (2021.4.4)
Collecting portalocker
  Downloading portalocker-2.3.2-py2.py3-none-any.whl (15 kB)
Requirement already satisfied: colorama in /opt/conda/lib/python3.6/site-packages (from sacrebleu>=1.4.12->-r requirements.txt (line 4)) (0.4.3)
Collecting tabulate>=0.8.9
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Requirement already satisfied: filelock in /opt/conda/lib/python3.6/site-packages (from huggingface-hub<0.1.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.12)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (2.10)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests>=2.19.0->datasets>=1.1.3->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.6/site-packages (from protobuf->-r requirements.txt (line 3)) (1.16.0)
Collecting absl-py
  Downloading absl_py-0.13.0-py3-none-any.whl (132 kB)
Requirement already satisfied: joblib in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 6)) (1.0.1)
Requirement already satisfied: click in /opt/conda/lib/python3.6/site-packages (from nltk->-r requirements.txt (line 6)) (7.1.2)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.10.0.0)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.6/site-packages (from importlib-metadata->datasets>=1.1.3->-r requirements.txt (line 1)) (3.4.1)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.6/site-packages (from packaging->datasets>=1.1.3->-r requirements.txt (line 1)) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas->datasets>=1.1.3->-r requirements.txt (line 1)) (2021.1)
Installing collected packages: tabulate, portalocker, nltk, absl-py, sacrebleu, rouge-score
  Attempting uninstall: tabulate
    Found existing installation: tabulate 0.8.7
    Uninstalling tabulate-0.8.7:
      Successfully uninstalled tabulate-0.8.7
Successfully installed absl-py-0.13.0 nltk-3.6.2 portalocker-2.3.2 rouge-score-0.0.4 sacrebleu-2.0.0 tabulate-0.8.9
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aws-parallelcluster 2.10.4 requires tabulate<=0.8.7,>=0.8.2, but you have tabulate 0.8.9 which is incompatible.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

2021-09-16 05:25:39,039 sagemaker-training-toolkit INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {},
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "predict_with_generate": true,
        "seed": 7,
        "do_predict": true,
        "do_train": true,
        "dataset_name": "amazon_reviews_multi",
        "num_train_epochs": 5,
        "do_eval": true,
        "dataset_config_name": "zh",
        "output_dir": "/opt/ml/model",
        "learning_rate": 5e-05,
        "model_name_or_path": "google/mt5-small",
        "fp16": true
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {},
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "huggingface-pytorch-training-2021-09-16-05-19-35-810",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2021-09-16-05-19-35-810/source/sourcedir.tar.gz",
    "module_name": "run_summarization",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "run_summarization.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"dataset_config_name":"zh","dataset_name":"amazon_reviews_multi","do_eval":true,"do_predict":true,"do_train":true,"fp16":true,"learning_rate":5e-05,"model_name_or_path":"google/mt5-small","num_train_epochs":5,"output_dir":"/opt/ml/model","predict_with_generate":true,"seed":7}
SM_USER_ENTRY_POINT=run_summarization.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=[]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=run_summarization
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2021-09-16-05-19-35-810/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"dataset_config_name":"zh","dataset_name":"amazon_reviews_multi","do_eval":true,"do_predict":true,"do_train":true,"fp16":true,"learning_rate":5e-05,"model_name_or_path":"google/mt5-small","num_train_epochs":5,"output_dir":"/opt/ml/model","predict_with_generate":true,"seed":7},"input_config_dir":"/opt/ml/input/config","input_data_config":{},"input_dir":"/opt/ml/input","is_master":true,"job_name":"huggingface-pytorch-training-2021-09-16-05-19-35-810","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-847380964353/huggingface-pytorch-training-2021-09-16-05-19-35-810/source/sourcedir.tar.gz","module_name":"run_summarization","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"run_summarization.py"}
SM_USER_ARGS=["--dataset_config_name","zh","--dataset_name","amazon_reviews_multi","--do_eval","True","--do_predict","True","--do_train","True","--fp16","True","--learning_rate","5e-05","--model_name_or_path","google/mt5-small","--num_train_epochs","5","--output_dir","/opt/ml/model","--predict_with_generate","True","--seed","7"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_HP_PREDICT_WITH_GENERATE=true
SM_HP_SEED=7
SM_HP_DO_PREDICT=true
SM_HP_DO_TRAIN=true
SM_HP_DATASET_NAME=amazon_reviews_multi
SM_HP_NUM_TRAIN_EPOCHS=5
SM_HP_DO_EVAL=true
SM_HP_DATASET_CONFIG_NAME=zh
SM_HP_OUTPUT_DIR=/opt/ml/model
SM_HP_LEARNING_RATE=5e-05
SM_HP_MODEL_NAME_OR_PATH=google/mt5-small
SM_HP_FP16=true
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python3.6 run_summarization.py --dataset_config_name zh --dataset_name amazon_reviews_multi --do_eval True --do_predict True --do_train True --fp16 True --learning_rate 5e-05 --model_name_or_path google/mt5-small --num_train_epochs 5 --output_dir /opt/ml/model --predict_with_generate True --seed 7


09/16/2021 05:25:45 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: True
09/16/2021 05:25:45 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/opt/ml/model', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=True, evaluation_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_ratio=0.0, warmup_steps=0, logging_dir='runs/Sep16_05-25-45_algo-1', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=500, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=None, no_cuda=False, seed=7, fp16=True, fp16_opt_level='O1', fp16_backend='auto', fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/opt/ml/model', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name='length', report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, mp_parameters='', sortish_sampler=False, predict_with_generate=True)
Downloading and preparing dataset amazon_reviews_multi/zh (download: 109.09 MiB, generated: 52.01 MiB, post-processed: Unknown size, total: 161.10 MiB) to /root/.cache/huggingface/datasets/amazon_reviews_multi/zh/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609...
Dataset amazon_reviews_multi downloaded and prepared to /root/.cache/huggingface/datasets/amazon_reviews_multi/zh/1.0.0/724e94f4b0c6c405ce7e476a6c5ef4f87db30799ad49f765094cf9770e0f7609. Subsequent calls will reuse this data.
https://huggingface.co/google/mt5-small/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpsal7nwx8
storing https://huggingface.co/google/mt5-small/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/97693496c1a0cae463bd18428187f9e9924d2dfbadaa46e4d468634a0fc95a41.dadce13f8f85f4825168354a04675d4b177749f8f11b167e87676777695d4fe4
creating metadata file for /root/.cache/huggingface/transformers/97693496c1a0cae463bd18428187f9e9924d2dfbadaa46e4d468634a0fc95a41.dadce13f8f85f4825168354a04675d4b177749f8f11b167e87676777695d4fe4
loading configuration file https://huggingface.co/google/mt5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/97693496c1a0cae463bd18428187f9e9924d2dfbadaa46e4d468634a0fc95a41.dadce13f8f85f4825168354a04675d4b177749f8f11b167e87676777695d4fe4
Model config MT5Config {
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.6.1",
  "use_cache": true,
  "vocab_size": 250112
}

loading configuration file https://huggingface.co/google/mt5-small/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/97693496c1a0cae463bd18428187f9e9924d2dfbadaa46e4d468634a0fc95a41.dadce13f8f85f4825168354a04675d4b177749f8f11b167e87676777695d4fe4
Model config MT5Config {
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 1024,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "mt5",
  "num_decoder_layers": 8,
  "num_heads": 6,
  "num_layers": 8,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.6.1",
  "use_cache": true,
  "vocab_size": 250112
}

https://huggingface.co/google/mt5-small/resolve/main/spiece.model not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp1dhqa6qf
storing https://huggingface.co/google/mt5-small/resolve/main/spiece.model in cache at /root/.cache/huggingface/transformers/37d0f67f084f8c5fc5589e0bba5ff3c6307af833bb0b7f4eb33fbfd8d4038a9d.84ea7af2df68dc8db434d3160aab65cce8ac63ce5b6f7743f8c9a4a14b4f77e2
creating metadata file for /root/.cache/huggingface/transformers/37d0f67f084f8c5fc5589e0bba5ff3c6307af833bb0b7f4eb33fbfd8d4038a9d.84ea7af2df68dc8db434d3160aab65cce8ac63ce5b6f7743f8c9a4a14b4f77e2
https://huggingface.co/google/mt5-small/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpqgr1lhnz
storing https://huggingface.co/google/mt5-small/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276
creating metadata file for /root/.cache/huggingface/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276
https://huggingface.co/google/mt5-small/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpvibtracc
storing https://huggingface.co/google/mt5-small/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/6a9e52d6dd21568e37b65fc180ada927968e8f7124f0acd6efcaf90cd2e0f4bb.4b81e5d952ad810ca1de2b3e362b9a26a5cc77b4b75daf20caf69fb838751c32
creating metadata file for /root/.cache/huggingface/transformers/6a9e52d6dd21568e37b65fc180ada927968e8f7124f0acd6efcaf90cd2e0f4bb.4b81e5d952ad810ca1de2b3e362b9a26a5cc77b4b75daf20caf69fb838751c32
loading file https://huggingface.co/google/mt5-small/resolve/main/spiece.model from cache at /root/.cache/huggingface/transformers/37d0f67f084f8c5fc5589e0bba5ff3c6307af833bb0b7f4eb33fbfd8d4038a9d.84ea7af2df68dc8db434d3160aab65cce8ac63ce5b6f7743f8c9a4a14b4f77e2
loading file https://huggingface.co/google/mt5-small/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/google/mt5-small/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/google/mt5-small/resolve/main/special_tokens_map.json from cache at /root/.cache/huggingface/transformers/685ac0ca8568ec593a48b61b0a3c272beee9bc194a3c7241d15dcadb5f875e53.f76030f3ec1b96a8199b2593390c610e76ca8028ef3d24680000619ffb646276
loading file https://huggingface.co/google/mt5-small/resolve/main/tokenizer_config.json from cache at /root/.cache/huggingface/transformers/6a9e52d6dd21568e37b65fc180ada927968e8f7124f0acd6efcaf90cd2e0f4bb.4b81e5d952ad810ca1de2b3e362b9a26a5cc77b4b75daf20caf69fb838751c32
https://huggingface.co/google/mt5-small/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmplpu5vm63
storing https://huggingface.co/google/mt5-small/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/8e7b2a80ddcb5611b27d8c89e1e8e33a947e105415051402a22b9c8d7d1caeb0.e22331f3a065b885b30ae3dd1ff11ccaf7fbc444485f6eb07ef5e0138bca8b70
creating metadata file for /root/.cache/huggingface/transformers/8e7b2a80ddcb5611b27d8c89e1e8e33a947e105415051402a22b9c8d7d1caeb0.e22331f3a065b885b30ae3dd1ff11ccaf7fbc444485f6eb07ef5e0138bca8b70
loading weights file https://huggingface.co/google/mt5-small/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/8e7b2a80ddcb5611b27d8c89e1e8e33a947e105415051402a22b9c8d7d1caeb0.e22331f3a065b885b30ae3dd1ff11ccaf7fbc444485f6eb07ef5e0138bca8b70
All model checkpoint weights were used when initializing MT5ForConditionalGeneration.

All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at google/mt5-small.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.
Using amp fp16 backend
***** Running training *****
  Num examples = 200000
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 125000
[2021-09-16 05:26:59.501 algo-1:31 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2021-09-16 05:26:59.621 algo-1:31 INFO profiler_config_parser.py:102] User has disabled profiler.
[2021-09-16 05:26:59.621 algo-1:31 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2021-09-16 05:26:59.622 algo-1:31 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2021-09-16 05:26:59.624 algo-1:31 INFO hook.py:255] Saving to /opt/ml/output/tensors
[2021-09-16 05:26:59.624 algo-1:31 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2021-09-16 05:26:59.881 algo-1:31 INFO hook.py:591] name:shared.weight count_params:128057344
[2021-09-16 05:26:59.881 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.SelfAttention.q.weight count_params:196608
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.SelfAttention.k.weight count_params:196608
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.SelfAttention.v.weight count_params:196608
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.SelfAttention.o.weight count_params:196608
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight count_params:192
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.0.layer_norm.weight count_params:512
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.1.DenseReluDense.wi_0.weight count_params:524288
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.1.DenseReluDense.wi_1.weight count_params:524288
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.1.DenseReluDense.wo.weight count_params:524288
[2021-09-16 05:26:59.882 algo-1:31 INFO hook.py:591] name:encoder.block.0.layer.1.layer_norm.weight count_params:512
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.0.SelfAttention.q.weight count_params:196608
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.0.SelfAttention.k.weight count_params:196608
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.0.SelfAttention.v.weight count_params:196608
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.0.SelfAttention.o.weight count_params:196608
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.0.layer_norm.weight count_params:512
[2021-09-16 05:26:59.883 algo-1:31 INFO hook.py:591] name:encoder.block.1.layer.1.DenseReluDense.wi_0.weight count_params:524288
name:encoder.block.5.layer.0.SelfAttention.q.weight count_params:196608
[2021-09-16 05:26:59.887 algo-1:31 INFO hook.py:591] name:encoder.block.5.layer.0.SelfAttention.k.weight count_params:196608
[2021-09-16 05:26:59.887 algo-1:31 INFO hook.py:591] name:encoder.block.5.layer.0.SelfAttention.v.weight count_params:196608
[2021-09-16 05:26:59.887 algo-1:31 INFO hook.py:591] name:encoder.block.5.layer.0.SelfAttention.o.weight count_params:196608
[2021-09-16 05:26:59.887 algo-1:31 INFO hook.py:591] ....arams:512
[2021-09-16 05:26:59.902 algo-1:31 INFO hook.py:591] name:decoder.block.7.layer.2.DenseReluDense.wi_0.weight count_params:524288
[2021-09-16 05:26:59.902 algo-1:31 INFO hook.py:591] name:decoder.block.7.layer.2.DenseReluDense.wi_1.weight count_params:524288
[2021-09-16 05:26:59.902 algo-1:31 INFO hook.py:591] name:decoder.block.7.layer.2.DenseReluDense.wo.weight count_params:524288
[2021-09-16 05:26:59.903 algo-1:31 INFO hook.py:591] name:decoder.block.7.layer.2.layer_norm.weight count_params:512
[2021-09-16 05:26:59.903 algo-1:31 INFO hook.py:591] name:decoder.final_layer_norm.weight count_params:512
[2021-09-16 05:26:59.903 algo-1:31 INFO hook.py:591] name:lm_head.weight count_params:128057344
[2021-09-16 05:26:59.903 algo-1:31 INFO hook.py:593] Total Trainable Params: 300176768
[2021-09-16 05:26:59.903 algo-1:31 INFO hook.py:425] Monitoring the collections: losses
[2021-09-16 05:26:59.906 algo-1:31 INFO hook.py:488] Hook is writing from the hook with pid: 31

{'loss': nan, 'learning_rate': 4.98664e-05, 'epoch': 0.02}
Saving model checkpoint to /opt/ml/model/checkpoint-500
Configuration saved in /opt/ml/model/checkpoint-500/config.json
Model weights saved in /opt/ml/model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /opt/ml/model/checkpoint-500/tokenizer_config.json
Special tokens file saved in /opt/ml/model/checkpoint-500/special_tokens_map.json
Copy vocab file to /opt/ml/model/checkpoint-500/spiece.model
{'loss': nan, 'learning_rate': 4.96664e-05, 'epoch': 0.04}
Saving model checkpoint to /opt/ml/model/checkpoint-1000
Configuration saved in /opt/ml/model/checkpoint-1000/config.json
Model weights saved in /opt/ml/model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /opt/ml/model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /opt/ml/model/checkpoint-1000/special_tokens_map.json
Copy vocab file to /opt/ml/model/checkpoint-1000/spiece.model
{'loss': nan, 'learning_rate': 4.9466400000000005e-05, 'epoch': 0.06}
Saving model checkpoint to /opt/ml/model/checkpoint-1500
Configuration saved in /opt/ml/model/checkpoint-1500/config.json
Model weights saved in /opt/ml/model/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in /opt/ml/model/checkpoint-1500/tokenizer_config.json
Special tokens file saved in /opt/ml/model/checkpoint-1500/special_tokens_map.json
Copy vocab file to /opt/ml/model/checkpoint-1500/spiece.model
{'loss': nan, 'learning_rate': 4.92664e-05, 'epoch': 0.08}
Saving model checkpoint to /opt/ml/model/checkpoint-2000
Configuration saved in /opt/ml/model/checkpoint-2000/config.json
Model weights saved in /opt/ml/model/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in /opt/ml/model/checkpoint-2000/tokenizer_config.json
Special tokens file saved in /opt/ml/model/checkpoint-2000/special_tokens_map.json
Copy vocab file to /opt/ml/model/checkpoint-2000/spiece.model


Hey @jackieliu930,

When using run_summarization.py with a T5 like the model you need to add an additional hyperparameter source_prefix: "summarize: ".

Only T5 models t5-small , t5-base , t5-large , t5-3b and t5-11b must use an additional argument: --source_prefix "summarize: " .

You can find more information about the run_summarization.py here:https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization#with-trainer

thanks!, actually, I found that it seems like the problem come from fp16 support. when I update the hyper parameter setting set fp16 to false, I got loss value. may I know why?

pinging @sgugger

Actually @patrickvonplaten or @valhalla might now better.

hi, any update on this one?