Text to Speech Alignment with Transformers

Hi there,

I have a large dataset of transcripts (without timestamps) and corresponding audio files (avg length of one hour). My goal is to temporally align the transcripts with the corresponding audio files.

Can anyone point me to resources, e.g., tutorials or huggingface models, that may help with the task? Are there any best practices for how to do it (without building an entire system from scratch)?

My initial naive idea was to use a STT model to transcribe the audio (while documenting timestamps) and then performing some kind of similarity search with the transcript to align the two. However, I feel this approach might be quite error prone.

I am happy for any kind of help/pointer. :slight_smile:


This task is called Forced Alignment and there are reasonably mature tools to do it with classical approaches. I’d suggest perusing forced-alignment · GitHub Topics · GitHub.

If the accuracy of the classical methods isn’t good enough for you, you can peruse research papers on, say, Speech | Papers With Code

Thank you so much for the reply! Currently, I’m starting to experiment with aeneas, however, I realize that the quality of my sound files is indeed very poor. Is it generally worthwile to try to improve the sound quality or would it be more fruitful to directly train/fine-tune a model to work with poorer sound quality end-2-end?