I have a large dataset of transcripts (without timestamps) and corresponding audio files (avg length of one hour). My goal is to temporally align the transcripts with the corresponding audio files.
Can anyone point me to resources, e.g., tutorials or huggingface models, that may help with the task? Are there any best practices for how to do it (without building an entire system from scratch)?
My initial naive idea was to use a STT model to transcribe the audio (while documenting timestamps) and then performing some kind of similarity search with the transcript to align the two. However, I feel this approach might be quite error prone.
I am happy for any kind of help/pointer.