Creating Batch Sizes for Video Transcription Dataset

Hello Everyone.
I am currently working on a project that involves segmenting a transcript of a video into homogenous sections based on the topics present in the video. I am currently implementing a transformer model similar to the one mentioned in this research paper https://arxiv.org/pdf/2110.07160.pdf
My dataset consists of video transcript instances along with a label column having values 1 (denoting a change in topic) or 0 (denoting the same topic).
I have stored the entire dataset in a dataframe object (see the attached image). The dataset consists of 600+ videos and i have merged them into a single dataframe.
I am currently confused as to how I am supposed to create batches. I know that the input to the transformer model should be of the dimensions [batch size, seq len, features]. However, I am confused on how to create batch sizes since every video has a different no of sentences. So i cant have a batch size of eg 50 sentences since that could result in a batch including more than 1 video in it which shouldnt be the case ( i think :sweat_smile:). Ideally, I would want batch size to represent the no of videos. So a batch size of 32 would mean 32 videos just as in a Computer Vision problem 32 represents 32 training images. Is that achievable in this case? If so, can someone guide me on how to achieve that. Your help would be greatly appreciated :smiley: