Multiple choice with variable number of choices

Hi all,

Similar question to

The reply there is to go full-blown text-to-text—which is a great idea!—but I’m interested in getting a discriminative BERT-esque baseline if possible (due to the dataset’s particular size, structure, and text content).

Since multiple choice models (like RobertaForMultipleChoice) detect the number of questions dynamically per batch, it seems like the main challenge is getting each batch to have a consistent number of choices.

Going off of the example, there are two main user-provided data processing functions:

  1. a preprocess_function() — for adding new features to the Dataset
  2. a collate function — for turning a raw batch into tensors

Unfortunately, while it’s no problem to add the number of choices for an example in 1., by the time 2. comes around, we’ve already been given a batch, so it’s too late to ensure they all have the same number of choices.

In other words, it seems like the sampler is the place where we’d make sure the number of choices is consistent per batch.

To my delight, I found the --group_by_length and --length_column_name options, which enable the (Distributed)LengthGroupedSampler. This opens up a potential way for doing this:

  1. Add a feature with the number of choices
  2. Pass this feature as the --length_column_name :see_no_evil: and use --group_by_length

Unfortunately for me, this constraint is a soft one rather than a hard one. (This makes sense for the original purpose, of course, which is just helping with padding.) This means that some batches do end up with multiple “lengths” (choices).

I wrote a quick test that injects a number from 1-4 as the feature and checked how many batches had multiple “lengths.” I was hoping it would be just three batches (at the borders between 1-2, 2-3, 3-4). Over 500 batches, there were 14 that ended up with mixed numbers. I could truncate these batches, which would only lose ~3% of the data. Not a huge loss, but it makes me wonder whether I can do better!

So, to complete the very long-winded question, I wonder whether anyone more familiar with Huggingface Transformers can recommend an approach to implement multiple choice with a variable number of choices. Right now my main options are:

  1. Write my own sampler to do this. Given this is just for a baseline, and my trepidation at debugging a custom distributed sampler, I worry this might not be worth the investment.

  2. Create multiple Dataset objects, each with a consistent number of choices. Do an outer training loop. (This would be less ideal because each number of choices corresponds to a question format, so this would increase coarse patterns into training / reduce how shuffled it is.)

  3. Just throw away the ~3% of mixed-data batches (less if we take the majority, so maybe ~1.5%).

  4. ??? (better approach I can’t see?)

Huge thanks for your time!

1 Like

In case anyone in the future is reading this, the above does work pretty well, but only for training. For evaluation, the (Distributed)LengthGroupedSampler is not used. Furthermore, even if it was, by nature of throwing some data away in the collator, we skip some of the evaluation set (which is a no-no for comparing results between methods).

I provided an example implementation of a batch sampler that groups based on a provided feature in the following comment:
Option for `(Distributed)LengthGroupedSampler` to treat groups as a hard constraint · Issue #12995 · huggingface/transformers · GitHub

It also requires a change to Transformers itself to support a batch sampler, which is in a PR linked to that issue.