Seq2SeqTrainer: enabled must be a bool (got NoneType)

Hi! I ran into this bug when running Seq2SeqTrainer and don’t know how to tackle this. Can someone help me look into it a bit? Thank you so much!

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="/home/sivan/whisper_base_fl_ch",
    per_device_train_batch_size=128,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=4000,
    gradient_checkpointing=True,
    fp16=True,
    group_by_length=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=False,
    disable_tqdm=True,
)

#%%
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=fleurs_ch["train"],
    eval_dataset=fleurs_ch["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
#%%
trainer.train()

And the error output goes:

***** Running training *****
  Num examples = 3246
  Num Epochs = 1334
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 1024
  Gradient Accumulation steps = 8
  Total optimization steps = 4000
  Number of trainable parameters = 72593920

TypeError                                 Traceback (most recent call last)
Cell In [49], line 1
----> 1 trainer.train()

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1515, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1510     self.model_wrapped = self.model
   1512 inner_training_loop = find_executable_batch_size(
   1513     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1514 )
-> 1515 return inner_training_loop(
   1516     args=args,
   1517     resume_from_checkpoint=resume_from_checkpoint,
   1518     trial=trial,
   1519     ignore_keys_for_eval=ignore_keys_for_eval,
   1520 )

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:1763, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1761         tr_loss_step = self.training_step(model, inputs)
   1762 else:
-> 1763     tr_loss_step = self.training_step(model, inputs)
   1765 if (
   1766     args.logging_nan_inf_filter
   1767     and not is_torch_tpu_available()
   1768     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1769 ):
1770     # if loss is nan or inf simply add the average of previous logged losses
   1771     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.local/lib/python3.9/site-packages/transformers/trainer.py:2521, in Trainer.training_step(self, model, inputs)
   2518     loss_mb = smp_forward_backward(model, inputs, self.args.gradient_accumulation_steps)
   2519     return loss_mb.reduce_mean().detach().to(self.args.device)
-> 2521 with self.compute_loss_context_manager():
   2522     loss = self.compute_loss(model, inputs)
   2524 if self.args.n_gpu > 1:

File ~/.local/lib/python3.9/site-packages/transformers/utils/generic.py:333, in ContextManagers.__enter__(self)
    331 def __enter__(self):
    332     for context_manager in self.context_managers:
--> 333         self.stack.enter_context(context_manager)

File /opt/conda/envs/pytorch_env/lib/python3.9/contextlib.py:448, in _BaseExitStack.enter_context(self, cm)
    446 _cm_type = type(cm)
    447 _exit = _cm_type.__exit__
--> 448 result = _cm_type.__enter__(cm)
    449 self._push_cm_exit(cm, _exit)
    450 return result

File ~/.local/lib/python3.9/site-packages/torch/autocast_mode.py:177, in autocast.__enter__(self)
    175     torch.set_autocast_enabled(self._enabled)
    176     torch.autocast_increment_nesting()
--> 177 torch.set_autocast_cache_enabled(self._cache_enabled)

TypeError: enabled must be a bool (got NoneType)

Hey @navissivan! Could you run the command:

transformers-cli env

(or !transformers-cli env in a notebook)

and copy and paste the output?

My thinking is that you might be running on a device that does not support automatic mixed precision (AMP) training Automatic Mixed Precision package - torch.amp — PyTorch 1.13 documentation

1 Like

Hi @sanchit-gandhi , thanks for the reply!
The output of the command is:

sivan@t4:~$ transformers-cli env
2022-11-08 22:08:02.037641: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-08 22:08:03.539861: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-08 22:08:05.610842: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-11-08 22:08:05.611030: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-11-08 22:08:05.611058: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
WARNING:tensorflow:From /home/sivan/.local/lib/python3.9/site-packages/transformers/commands/env.py:52: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2022-11-08 22:08:12.661689: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-08 22:08:12.671034: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-08 22:08:12.773751: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-11-08 22:08:12.783037: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.25.0.dev0
- Platform: Linux-4.19.0-22-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.13
- Huggingface_hub version: 0.10.1
- PyTorch version (GPU?): 1.10.0+cu102 (True)
- Tensorflow version (GPU?): 2.10.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Could you try with fp16=False in the Training Arguments?

Thanks! This works! Still got a few question regarding the training process:

  1. The training proceed with a lot of warnings like use_cache = True is incompatible with gradient checkpointing. Setting use_cache = False
  2. After running trainer.train(), the gpu and output stays idle for about 5 mins to start the training process, is it normal?
  3. The output shows things like {'loss': 1.7758, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.12}, but since this training takes a long time, is there any way to see a progress bar indicating how much time left to be finished. Also any method to run the notebook in the background (Linux) in case exiting the notebook terminate the training?
  4. And how can I generate a printout table like end of the post?
  5. I forgot to add validation set to the trainer, is there any way I can run through all the saved checkpoint models on validation once?
1 Like

I also have the same warning as 1.
Anyway, I have a problem in the evaluation step with the tensor mismatch as this message:
“The size of tensor a (xxx) must match the size of tensor b (448) at non-singleton dimension 1”

So, @sanchit-gandhi , how can I fix this problem?

Hey @navissivan and @ksoky,

  1. Don’t worry about the use_cache warning, it just means that we cannot use the k,v cache for the attention mechanism with gradient checkpointing. If you want to disable the warning, load the model and then set use_cache to False:
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model.config.use_cache = False

The operation of the model is the same with and without cache - we just use cache to speed up decoding. Cache isn’t compatible when we use gradient checkpointing, so it’s disabled by the Trainer and a warning shown instead.

  1. It shouldn’t stay idle for that long - usually this happens when we set group_by_length=True but haven’t specified input_lengths in out prepare_dataset function. Have you modified the prepare_dataset function? Could you make sure the dataset that you pass to trainer has the input_lengths column?

  2. A progress bar should show - you need to set disable_tqdm=False in your training args.

You have a couple of options for running it in the background:

  • tmux: call tmux and then run Jupyter notebooks from the tmux shell:
tmux new -s mysession
jupyter lab

Then run your shell as normal. The process will continue running even when you close your shell. When you re-open your shell, you can reattach through:

tmux a -t mysession

Check out the docs for more info.

  • The other option is to export the ipynb notebook as a python script, and then run it using tmux or nohup:
    From File → Export Notebook As… in the Jupyter Lab menu select ‘Export Notebook to Executable Script’. This will give you a Python script to download. Then run it using tmux (as above) or nohup:
nohup python fine-tuning-whisper.py

You can open a new window to view the output:

vim nohup.out
  1. The table generates automatically by the Trainer if you perform evaluation over the course of training.

  2. It’s possible. The model checkpoint saved at step 1000 saves in the output directory under /home/sivan/whisper_base_fl_ch/checkpoint-1000
    You can load a model checkpoint from the saved checkpoint at step 1000 as follows:

model = WhisperForConditionalGeneration.from_pretrained("/home/sivan/whisper_base_fl_ch/checkpoint-1000")

You can then run a validation step:

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

training_args = Seq2SeqTrainingArguments(
    output_dir="/home/sivan/whisper_base_fl_ch/validation_step",
    do_train=False,
    do_eval=True,
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_strategy="no",
    report_to=["tensorboard"],
    push_to_hub=False,
    disable_tqdm=False,
)

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    eval_dataset=fleurs_ch["validation"],  # set to your val set
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

trainer.evaluate()

You can then repeat this for the checkpoints in directories checkpoint-2000, checkpoint-3000 and so on.

4 Likes

Sorry again @sanchit-gandhi,

I usually got the error at: transformers/modeling_whisper.py at main ¡ huggingface/transformers ¡ GitHub

The error message is “RuntimeError: The size of tensor a (674) must match the size of tensor b (448) at non-singleton dimension 1”

Do you have glue of this error?

Hey @ksoky - it’s very hard to help without the full traceback or a reproducible code snippet! Would you mind opening a new post on the forum and we can move the discussion there? If you include a code snippet to reproduce I can certainly look into it!

Hi @sanchit-gandhi, thanks sooooo much for the detailed answers!!!

Got some follow-up questions regarding:
2. I mostly used the codes you posted, and the original prepare_dataset doesn’t have an input_length column. How do I add that column and why do i need group_by_length=True?
4. I don’t think I got the table output, I just got stuff like:

Loading best model from /home/sivan/whisper_base_fl_ch/checkpoint-3000 (score: 93.11493614658522).

TrainOutput(global_step=4000, training_loss=0.10508605905435979, metrics={'train_runtime': 19400.6337, 'train_samples_per_second': 3.299, 'train_steps_per_second': 0.206, 'train_loss': 0.10508605905435979, 'epoch': 19.7})
  1. If I want to do train/val/eval: train on train set, choose model on val set, then final evaluation on eval set, what I need is to run trainer.train() with train_dataset=train, eval_dataset=val first, then do this trainer.evaluate() with eval_dataset=eval again, right? Is this a correct way to evaluate the fine-tuned model?

And some new questions:

  1. Does normalizer matter during training? I didn’t see any normalizer in your codes, but my fine-tuned model results on the test set if really bad, so I checked the pretrained model on the test set again and found out the performance drop was due to the normalizer, which means I have to make a custom normalizer to apply to the decoded prediction. How do I include this normalizer function during train/val/eval for the trainer? I assume i need to specify it in compute_metrics:
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # normalizer
    pred_str = custom_normalizer(pred_str, "zh")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

And here the tokenizer.batch_decode is the same as processor.batch_decode right? Could you help me check if this is the right way to add normalizer in inference?

  1. How do I continue training from a checkpoint?
  2. When I runtrainer.evaluate() as you mentioned, I didn’t get a correct progress bar and I won’t get metric results printed out either:

Hey @navissivan

  1. Sorry, I thought about it and we don’t need group_by_length: group by length sorts together samples of roughly the same length in the training dataset (to minimize padding applied and be more efficient). But since all of our samples are padded/truncated to 30s by the Whisper feature extractor, the padding is the same for all samples. Long story short, set group_by_length=False. This will mean training starts immediately! I’ve updated the template Colab to reflect this.

  2. Oh that’s strange - is Trainer definitely performing evaluation? Do you see the message RUNNING EVALUATION pop up on the traceback? The progress bar will definitely show if you run it as a python script - might be something to do with the notebook environment. Could you also check the “README.md” in your output directory? The table should have saved there as well :slight_smile:

  3. You set the Trainer as follows with your train and val set:

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=fleurs_ch["train"],
    eval_dataset=fleurs_ch["validation"],  # validation set
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

And then after training, you run an extra ‘prediction’ step on your test set:

predict_results  = trainer.predict(fleurs_ch["test"], metric_key_prefix="test")
metrics = predict_results.metrics
trainer.log_metrics("test", metrics)
trainer.save_metrics("test", metrics)

New q’s:

  1. I didn’t include a normaliser to keep the example streamlined. You can certainly include one if you don’t care about casing or punctuation in your transcriptions. Yep, this is the way to do it, but make sure you apply the normaliser to your label string and prediction string:
    # normalizer
    pred_str = custom_normalizer(pred_str, "zh")
    label_str = custom_normalizer(label_str, "zh")

And here the tokenizer.batch_decode is the same as processor.batch_decode right?

Correct!

How do I continue training from a checkpoint?

This has already been asked before :wink:

1 Like

Thanks for your patience and all the help!

Oh that’s strange - is Trainer definitely performing evaluation? Do you see the message RUNNING EVALUATION pop up on the traceback? The progress bar will definitely show if you run it as a python script - might be something to do with the notebook environment. Could you also check the “README.md” in your output directory? The table should have saved there as well

Regarding the trainer output, I’m using jupyter notebook, the evaluation is definitely running, I saw the message, but the progress bar is not moving, and there’s no readme.md file in the output directory (which is output_dir="/home/sivan/whisper_base_fl_ch/validations").
Anyway this is trivial issue for me now, my workaround is to manually print the returned metrics result:

results.append(trainer.evaluate())
results = pd.DataFrame(results)
results.to_csv(os.path.join(os.getcwd(), 'wer_fl_ch.csv'))

And one more question about the normalizer, since it’s only for calculating WER, I assume it shouldn’t affect any training process, which means the punctuations should be part of the training task as for the input features and label ids?

Hi @sanchit-gandhi, sorry to bother you again, but I started fine-tuning on a new dataset, followed everything just like before, but run into a trainer issue and don’t know what to do:

RuntimeError: The size of tensor a (462) must match the size of tensor b (448) at non-singleton dimension 1

I opened a new post here.

I’d appreciate it a lot if you could help me look a bit. Thanks!

Hey @navissivan

You’re right, the punctuation will be part of the training task. This makes the task of speech recognition harder, but does mean your model will predict casing and punctuation. You can return two WER metrics if you want, one without normalisation and one with:

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    pred_str_norm = custom_normalizer(pred_str, "zh")
    label_str_norm = custom_normalizer(label_str, "zh")

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    wer_norm = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer": wer, "wer_norm": wer_norm}

This will give you both WER metrics, unnormalised and normalised. So you can see the effect normalisation has on the WER.

If you want to remove punctuation and casing for training, just apply the normaliser in your prepare_dataset function:

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]

    # compute log-Mel input features from input audio array 
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids 
    norm_text = normaliser(batch["sentence"])
    batch["labels"] = tokenizer(text).input_ids
    return batch
2 Likes

Also, this is how you should check whether to use fp16 or not, by checking if you have APEX for mixed-precision training:

from transformers import is_apex_available
import torch

def is_cuda_and_apex_available():
    is_using_cuda = torch.cuda.is_available() and torch_device == "cuda"
    return is_using_cuda and is_apex_available()

use_fp16 = is_cuda_and_apex_available()

print(use_fp16)

If use_fp16 is True, you can set fp16=True in your training args for mixed-precision training. Otherwise, you cannot use fp16.

2 Likes

I also came here with the same issue, model.config.use_cache = False seems to have solved it.

As a philosophical point:

  1. caching downloaded models, datasets = good
  2. caching intermediate results of dataset.map/model = not-good, causes lots of confusion & makes disk IO the bottleneck

Particularly, enabling caching in (2) BY DEFAULT is not a good choice, I really wish HuggingFace changes this behavior for both datasets and transformers.

Working code >>> fast code. If I want to speed something up, I will try to enable caching.