Hello my friends!
Over the past year, I’ve been developing a beginner-friendly toolkit for our audio engineers out there looking to create datasets of TV and Movie actors. My free toolkit is available on my GitHub: SID Toolkit
The toolkit will allow you to give it a massive folder of video files, or just one file if it’s a movie, and extract the ENGLISH audio to a mono WAV file. After you isolate vocals with UVR or similar, the script “Diarizes” the audio using your HF_Token, to output a JSON of data of speakers in that audio file.
Once you identify about 5-20 files with your targeted speaker (more files needed if there’s a lot of different speakers) manually, you can use the cross-reference script to isolate the same speaker from ALL files in your working directory.
Finally, the isolate script then cuts up the audio files, isolating only the speaker you ID’ed, and clipping out all silences and non-speaker data, so a dataset can be created.
In my own examples, I used the TV Series House. There’s a total of 187 episodes released for this show, and after I brought it down to a mono-audio WAV, I identified “House” 7 times. It created a dataset with 817 WAV files, all 1 to 5 seconds long, trimmed and isolated and truncated so only House is speaking.
Please post bug reports and what not on the GitHub so I can keep working on it