Speech recognition processing max_length

Hello. I am working on an automatic speech recognition project. But most of the models say that the maximum processing time is 30 seconds. If the original data is longer than 30 seconds, I would like to know how to handle it. For example, is there a way to handle it in code other than manually dividing the original data?

1 Like

This might have something to do with it.

Thanks for pointing out a good way. But I was referring to the preprocessing of the dataset before fine-tuning. So I was wondering how to divide the data with text and audio into answer chunks, if possible.

Would it be a pre-processing thing?

Thanks for the great resources. These are correct, but they are things I’ve referenced in the past. According to the tutorial, 30 seconds is the maximum. I can’t use the data after 30 seconds because it returns false. It says to discard anything over 30 seconds.

Would you simply split the file?

1 Like

That’s a nice package. So I guess the only way to divide text into chunks is to do it myself?

I think it would be best if there was an integrated dataset processing library, but if the current one doesn’t support it…
Even though sound and text are bundled together, they are completely different information from the program’s point of view, so we need a program that ties the two together and finds the split point.
We should have to do SR at the stage of creating the data set. It’s hard to do this manually.
I wonder if anyone has made one…

Edit:
It’s close, but there’s no function for text splitting.

Other famous libraries when doing it manually

Isn’t this one close?

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.