Inconsistent SpeechT5 Sinusoidal Positional Embedding weight tensor shape in fine-tuning run sessions

I am fine-tuning SpeechT5 with a dataset using Accelerate for two GPU parallelism. But I experienced an issue with SpeechT5SinusoidalPositionalEmbedding in SpeechT5TextDecoderPrenet. By SpeechT5 configuration, the embed_positions has 604x768 tensor shape. However, its weight had the changed shape in the saved state_dict. The same code had the saved weight with 691x768 tensor shape in one fine-tuning session and 607x768 in another session. This didn’t happen when I ran a session with 1 epoch with 2 iterations for a quick check. So, it just happened in full dataset full session fine-tuning. It puzzles me. Could anyone shed some light on what could be the cause?

2 Likes

I know you’re facing with the SpeechT5SinusoidalPositionalEmbedding seems related to how positional embeddings are handled when using multiple GPUs during fine-tuning.

I will give you some tips for your fine-tuning

  1. Multiple GPUs: When using two GPUs, the model might split the data differently, which could lead to mismatched tensor sizes. Make sure the model and optimizer are properly synchronized across the GPUs.
  2. Saving and Loading State: The problem could be from how the state is saved and loaded between sessions. Ensure the state_dict is being saved and loaded correctly with the right settings, and that the positional embedding layer isn’t changing unexpectedly.
  3. Positional Embedding Shape: The shape of the positional embeddings might change depending on the input sequence length. In your longer fine-tuning session, if sequence lengths vary, this could explain the mismatch in shapes.
  4. Batch Size and DataLoader: When fine-tuning with a full dataset, the batch size or sequence lengths might change, causing inconsistencies in the embeddings. Make sure the sequence lengths are consistent across epochs.
  5. Training Configuration: If you’re using techniques like gradient accumulation or freezing layers, check that the embeddings are being updated consistently.
1 Like

Thank you Alan for your insightful response. If I understood right, (transitional) latent state tensor shape, including positional embeddings, may change depending on input shape. But here is the model parameter weight being changed. I looked into its source codes and I don’t expect the weight to be change once being initialized and up running unless the weight node is dynamically replaced in the running graph. But then it is all underlying. Otherwise, it should not be subject to GPUs, input shape, and batch size. In the other two cases, I used pretrained model loaded by Huggingface’s from_pretrained API that hides away all configuration details. That left the only possible issue of saving and loading mechanism. I relied on Accelerator implementation. Could Accelerator mistakes somehow? But then, I have used the same source code with other models, e.g., Whisper with no such problem. It is puzzling. Could you please correct me if I misunderstood something here? Thank you.

1 Like