Inconsistent SpeechT5 Sinusoidal Positional Embedding weight tensor shape in fine-tuning run sessions

chriss2023 · December 15, 2024, 2:37am

I am fine-tuning SpeechT5 with a dataset using Accelerate for two GPU parallelism. But I experienced an issue with SpeechT5SinusoidalPositionalEmbedding in SpeechT5TextDecoderPrenet. By SpeechT5 configuration, the embed_positions has 604x768 tensor shape. However, its weight had the changed shape in the saved state_dict. The same code had the saved weight with 691x768 tensor shape in one fine-tuning session and 607x768 in another session. This didn’t happen when I ran a session with 1 epoch with 2 iterations for a quick check. So, it just happened in full dataset full session fine-tuning. It puzzles me. Could anyone shed some light on what could be the cause?

Alanturner2 · December 15, 2024, 1:57pm

I know you’re facing with the SpeechT5SinusoidalPositionalEmbedding seems related to how positional embeddings are handled when using multiple GPUs during fine-tuning.

I will give you some tips for your fine-tuning

Multiple GPUs: When using two GPUs, the model might split the data differently, which could lead to mismatched tensor sizes. Make sure the model and optimizer are properly synchronized across the GPUs.
Saving and Loading State: The problem could be from how the state is saved and loaded between sessions. Ensure the state_dict is being saved and loaded correctly with the right settings, and that the positional embedding layer isn’t changing unexpectedly.
Positional Embedding Shape: The shape of the positional embeddings might change depending on the input sequence length. In your longer fine-tuning session, if sequence lengths vary, this could explain the mismatch in shapes.
Batch Size and DataLoader: When fine-tuning with a full dataset, the batch size or sequence lengths might change, causing inconsistencies in the embeddings. Make sure the sequence lengths are consistent across epochs.
Training Configuration: If you’re using techniques like gradient accumulation or freezing layers, check that the embeddings are being updated consistently.

chriss2023 · December 17, 2024, 9:14am

Thank you Alan for your insightful response. If I understood right, (transitional) latent state tensor shape, including positional embeddings, may change depending on input shape. But here is the model parameter weight being changed. I looked into its source codes and I don’t expect the weight to be change once being initialized and up running unless the weight node is dynamically replaced in the running graph. But then it is all underlying. Otherwise, it should not be subject to GPUs, input shape, and batch size. In the other two cases, I used pretrained model loaded by Huggingface’s from_pretrained API that hides away all configuration details. That left the only possible issue of saving and loading mechanism. I relied on Accelerator implementation. Could Accelerator mistakes somehow? But then, I have used the same source code with other models, e.g., Whisper with no such problem. It is puzzling. Could you please correct me if I misunderstood something here? Thank you.

Topic		Replies	Views
SpeechT5 Text to Speech fine tuning runtime error Models	1	308	January 12, 2024
Issues with Whisper Encoder: Positional Encoding 🤗Transformers	4	1573	November 16, 2022
LongBlender embedding positions mismatch 🤗Transformers	0	523	April 19, 2021
Change Positional Embedding in T5 from Relative to Absolute 🤗Transformers	0	679	May 25, 2022
Positional Embeddings in Transformer Implementations 🤗Transformers	1	1785	September 3, 2024

Inconsistent SpeechT5 Sinusoidal Positional Embedding weight tensor shape in fine-tuning run sessions

Related topics