Text to Speech Alignment with Transformers

simonschoe · March 28, 2022, 2:00pm

Hi there,

I have a large dataset of transcripts (without timestamps) and corresponding audio files (avg length of one hour). My goal is to temporally align the transcripts with the corresponding audio files.

Can anyone point me to resources, e.g., tutorials or huggingface models, that may help with the task? Are there any best practices for how to do it (without building an entire system from scratch)?

My initial naive idea was to use a STT model to transcribe the audio (while documenting timestamps) and then performing some kind of similarity search with the transcript to align the two. However, I feel this approach might be quite error prone.

I am happy for any kind of help/pointer.

Simon

kcarnold · April 19, 2022, 1:10pm

This task is called Forced Alignment and there are reasonably mature tools to do it with classical approaches. I’d suggest perusing forced-alignment · GitHub Topics · GitHub.

If the accuracy of the classical methods isn’t good enough for you, you can peruse research papers on, say, Speech | Papers With Code

simonschoe · April 20, 2022, 7:17am

Thank you so much for the reply! Currently, I’m starting to experiment with aeneas, however, I realize that the quality of my sound files is indeed very poor. Is it generally worthwile to try to improve the sound quality or would it be more fruitful to directly train/fine-tune a model to work with poorer sound quality end-2-end?

Topic		Replies	Views
Using Whisper's text-timing functionality on a pre-existing transcript Models	0	193	July 27, 2023
Speech to Text concern 🤗Transformers	0	388	August 27, 2023
How to get timestamps for each word in a transcription 🤗Transformers	2	2100	September 5, 2023
Whisper fine tuning on custom audio data Beginners	4	2815	February 15, 2025
Speech to Text timestamp for words Models	0	674	February 16, 2022

Text to Speech Alignment with Transformers

Related topics