Issues saving and loading wav2vec2 models fine tuned using Deepspeed

Hi,

First off, thanks for all your hard work and publicly released code!

I’ve implemented the Deepspeed integration for fine tuning a wav2vec2 model here. I had to make some small changes for it to run in my environment, but essentially, my code is identical to the example.

After training some toy models, I realized that I couldn’t load from the checkpoints or save and reload the model in the same way that other wav2vec2 fine tuned models can be saved and loaded via the *.from_pretrained() commands.

To reproduce my struggles, simply add the following at the end of the run_asr.py script in the example repo I mentioned above and run with 2 GPUs, using the repo-provided Zero3 config and relevant command:

tuned_model_dir = './my_tuned_model'
print('--> Saving fine tuned model with processor in', tuned_model_dir)
model.save_pretrained(tuned_model_dir)
processor.save_pretrained(tuned_model_dir)
print('--> Checking if model and processor were saved properly..')
model = Wav2Vec2ForCTC.from_pretrained(tuned_model_dir).to('cuda')
print('--> Worker {} Got Model!'.format(training_args.local_rank))
processor = Wav2Vec2Processor.from_pretrained(tuned_model_dir)
print('--> Worker {} Got Processor!'.format(training_args.local_rank))

I’m including the log below (starting after the training finishes successfully). Interestingly, the print statements reveal in the log that one of the workers seems to load the model successfully but the other does not. How can this be?

I’ve also tried several other methods of saving and loading the fine tuned model that all failed. These include saving the model at the end of training, then attempting to load the saved model offline (which again complains about a size mismatch saying the model layers are all curiously of size [1]); as well as all the more complicated save and load procedures outlined here and here (which mostly also all fail for the same reason).

Any help would be greatly appreciated! The model trains, I’m most of the way there… I just can’t use it haha. Thanks in advance!

Log:

Training completed. Do not forget to share your model on huggingface.co/models =)

→ Saving fine tuned model with processor in ./my_tuned_model
{‘train_runtime’: 97.5809, ‘train_samples_per_second’: 5.124, ‘train_steps_per_second’: 1.281, ‘train_loss’: 304.0380625, ‘epoch’: 1.0}
100%|██████████| 125/125 [01:37<00:00, 1.28it/s]
→ Saving fine tuned model with processor in ./my_tuned_model
Configuration saved in ./my_tuned_model/config.json
Model weights saved in ./my_tuned_model/pytorch_model.bin
Configuration saved in ./my_tuned_model/preprocessor_config.json
tokenizer config file saved in ./my_tuned_model/tokenizer_config.json
Special tokens file saved in ./my_tuned_model/special_tokens_map.json
→ Checking if model and processor were saved properly…
loading configuration file ./my_tuned_model/config.json
Model config Wav2Vec2Config {
“_name_or_path”: “./”,
“activation_dropout”: 0.0,
“apply_spec_augment”: true,
“architectures”: [
“Wav2Vec2ForCTC”
],
“attention_dropout”: 0.1,
“bos_token_id”: 1,
“classifier_proj_size”: 256,
“codevector_dim”: 256,
“contrastive_logits_temperature”: 0.1,
“conv_bias”: false,
“conv_dim”: [
512,
512,
512,
512,
512,
512,
512
],
“conv_kernel”: [
10,
3,
3,
3,
3,
2,
2
],
“conv_stride”: [
5,
2,
2,
2,
2,
2,
2
],
“ctc_loss_reduction”: “sum”,
“ctc_zero_infinity”: false,
“diversity_loss_weight”: 0.1,
“do_stable_layer_norm”: false,
“eos_token_id”: 2,
“feat_extract_activation”: “gelu”,
“feat_extract_norm”: “group”,
“feat_proj_dropout”: 0.1,
“feat_quantizer_dropout”: 0.0,
“final_dropout”: 0.0,
“freeze_feat_extract_train”: true,
“gradient_checkpointing”: false,
“hidden_act”: “gelu”,
“hidden_dropout”: 0.1,
“hidden_size”: 768,
“initializer_range”: 0.02,
“intermediate_size”: 3072,
“layer_norm_eps”: 1e-05,
“layerdrop”: 0.05,
“mask_channel_length”: 10,
“mask_channel_min_space”: 1,
“mask_channel_other”: 0.0,
“mask_channel_prob”: 0.0,
“mask_channel_selection”: “static”,
“mask_feature_length”: 10,
“mask_feature_prob”: 0.0,
“mask_time_length”: 10,
“mask_time_min_space”: 1,
“mask_time_other”: 0.0,
“mask_time_prob”: 0.05,
“mask_time_selection”: “static”,
“model_type”: “wav2vec2”,
“no_mask_channel_overlap”: false,
“no_mask_time_overlap”: false,
“num_attention_heads”: 12,
“num_codevector_groups”: 2,
“num_codevectors_per_group”: 320,
“num_conv_pos_embedding_groups”: 16,
“num_conv_pos_embeddings”: 128,
“num_feat_extract_layers”: 7,
“num_hidden_layers”: 12,
“num_negatives”: 100,
“pad_token_id”: 0,
“proj_codevector_dim”: 256,
“torch_dtype”: “float32”,
“transformers_version”: “4.12.0.dev0”,
“use_weighted_layer_sum”: false,
“vocab_size”: 32
}

loading weights file ./my_tuned_model/pytorch_model.bin
→ Checking if model and processor were saved properly…
Detected DeepSpeed ZeRO-3: activating zero.init() for this model
Traceback (most recent call last):
File “run_asr.py”, line 576, in
main()
File “run_asr.py”, line 566, in main
model = Wav2Vec2ForCTC.from_pretrained(tuned_model_dir).to(‘cuda’)
File “/home/users/aerdmann/repos/others/transformers/src/transformers/modeling_utils.py”, line 1442, in from_pretrained
\n
Worker 1 Got Model!

model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_state_dict_into_model(

File “/home/users/aerdmann/repos/others/transformers/src/transformers/modeling_utils.py”, line 1594, in _load_state_dict_into_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.class.name}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for Wav2Vec2ForCTC:
size mismatch for wav2vec2.feature_extractor.conv_layers.1.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3]).
size mismatch for wav2vec2.feature_extractor.conv_layers.2.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3]).
size mismatch for wav2vec2.feature_extractor.conv_layers.3.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3]).
size mismatch for wav2vec2.feature_extractor.conv_layers.4.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 3]).
size mismatch for wav2vec2.feature_extractor.conv_layers.5.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 2]).
size mismatch for wav2vec2.feature_extractor.conv_layers.6.conv.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([512, 512, 2]).
size mismatch for wav2vec2.feature_projection.projection.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 512]).
size mismatch for wav2vec2.encoder.pos_conv_embed.conv.weight_v: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 48, 128]).
size mismatch for wav2vec2.encoder.layers.0.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.0.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.0.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.0.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.0.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.0.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.1.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.1.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.1.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.1.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.1.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.1.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.2.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.2.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.2.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.2.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.2.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.2.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.3.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.3.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.3.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.3.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.3.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.3.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.4.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.4.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.4.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.4.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.4.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.4.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.5.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.5.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.5.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.5.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.5.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.5.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.6.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.6.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.6.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.6.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.6.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.6.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.7.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.7.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.7.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.7.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.7.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.7.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.8.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.8.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.8.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.8.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.8.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.8.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.9.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.9.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.9.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.9.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.9.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.9.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.10.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.10.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.10.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.10.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.10.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.10.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for wav2vec2.encoder.layers.11.attention.k_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.11.attention.v_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.11.attention.q_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.11.attention.out_proj.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 768]).
size mismatch for wav2vec2.encoder.layers.11.feed_forward.intermediate_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([3072, 768]).
size mismatch for wav2vec2.encoder.layers.11.feed_forward.output_dense.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([768, 3072]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([32, 768]).
\n
Worker 1 Got Processor!

Killing subprocess 2105751
Killing subprocess 2105752
Traceback (most recent call last):
File “/home/software/kdd/Anaconda3.8-2021.05/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/software/kdd/Anaconda3.8-2021.05/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/software/kdd/Anaconda3.8-2021.05/lib/python3.8/site-packages/deepspeed/launcher/launch.py”, line 171, in
main()
File “/home/software/kdd/Anaconda3.8-2021.05/lib/python3.8/site-packages/deepspeed/launcher/launch.py”, line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File “/home/software/kdd/Anaconda3.8-2021.05/lib/python3.8/site-packages/deepspeed/launcher/launch.py”, line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command ‘[’/home/software/kdd/Anaconda3.8-2021.05/bin/python’, ‘-u’, ‘run_asr.py’, ‘–local_rank=1’, ‘–output_dir=./wav2vec2-base-timit-asr’, ‘–num_train_epochs=1’, ‘–per_device_train_batch_size=2’, ‘–per_device_eval_batch_size=2’, ‘–evaluation_strategy=steps’, ‘–save_steps=500’, ‘–eval_steps=100’, ‘–logging_steps=25’, ‘–learning_rate=5e-4’, ‘–warmup_steps=3000’, ‘–model_name_or_path=facebook/wav2vec2-base’, ‘–dataset_name=timit_asr’, ‘–dataset_config_name=clean’, ‘–train_split_name=train’, ‘–validation_split_name=test’, ‘–orthography=timit’, ‘–preprocessing_num_workers=1’, ‘–group_by_length’, ‘–freeze_feature_extractor’, ‘–deepspeed’, ‘ds_config_wav2vec2_zero3.json’]’ returned non-zero exit status 1.