Creating a dataset with many utterances per audio file?


I have an audio+transcripts corpus consisting of long audio files. Each audio file has its own metadata file where the time code of each segment is defined along with its transcript.
Is there a way to generate a HF dataset from this structure without having to split the original audio files to single utterance audio files ?
Datasets like AMI (edinburghcstr/ami · Datasets at Hugging Face) and others do have a “begin_time” and “end_time” column, but it doesn’t look that those fields are being used in the dataset script either…