Dataset splitting according to several constraints

I am working on splitting my data for a large language model. The dataset contains about 4000 meetings. For my LLM to use this data, I have to create a division into three different modalities: Speech tokens, interleaved (speech and text tokens), and text tokens. For every meeting the token count for each modality is given. I am looking for an algorithm that sorts the dataset across the modalities, according to the following criteria:

  • The meetings in a modality should not exceed the maximum number of tokens per meetings. Since a meeting contians more tokens in the speech modality than in the other modalities, it can be the case that there are 20,000 speech tokens, which goes beyond the 16k token limit. Therefore, such a meeting should be assigned to the text or interleaved modality, where it has less tokens. All meetings can be assigned to the text-only modality, as there are no meetings that exceed the text tokens context window.
  • When the sorting is done, every modality should contain a similar number of tokens in that modality
  • The meetings are categorised as short, medium, or long. The meetings should be evenly distributed across these categories in every modality, given the previous constrainsts. There is also an additional option to look at the heritage of the meeting (from one of 6 cities), if possible it would be nice if it can also be evenly distributed, if not it is also fine.

Does anyone have a lead to which type of algorithm this can be linked to? I am looking to implement the splitting process in python, so any tips on good libraries to use would be appreciated!

1 Like

There seems to be a known algorithm.