Dataset splitting according to several constraints

Denise-dW · October 27, 2025, 12:28pm

I am working on splitting my data for a large language model. The dataset contains about 4000 meetings. For my LLM to use this data, I have to create a division into three different modalities: Speech tokens, interleaved (speech and text tokens), and text tokens. For every meeting the token count for each modality is given. I am looking for an algorithm that sorts the dataset across the modalities, according to the following criteria:

The meetings in a modality should not exceed the maximum number of tokens per meetings. Since a meeting contians more tokens in the speech modality than in the other modalities, it can be the case that there are 20,000 speech tokens, which goes beyond the 16k token limit. Therefore, such a meeting should be assigned to the text or interleaved modality, where it has less tokens. All meetings can be assigned to the text-only modality, as there are no meetings that exceed the text tokens context window.
When the sorting is done, every modality should contain a similar number of tokens in that modality
The meetings are categorised as short, medium, or long. The meetings should be evenly distributed across these categories in every modality, given the previous constrainsts. There is also an additional option to look at the heritage of the meeting (from one of 6 cities), if possible it would be nice if it can also be evenly distributed, if not it is also fine.

Does anyone have a lead to which type of algorithm this can be linked to? I am looking to implement the splitting process in python, so any tips on good libraries to use would be appreciated!

John6666 · October 28, 2025, 2:50am

There seems to be a known algorithm.

Topic		Replies	Views
Using truncated fragments as input samples in training 🤗Tokenizers	3	693	July 1, 2021
On-the-fly splitting for datasets with long texts 🤗Transformers	0	762	November 8, 2021
Concatenate Sentances Beginners	0	415	March 8, 2021
Split long text into "topics" 🤗Transformers	0	755	March 16, 2023
Chunk tokens into desired chunk length without simply getting rid of rest of tokens 🤗Tokenizers	0	646	June 15, 2023

Dataset splitting according to several constraints

Related topics