Replicating GPT-2 CBT-CN benchmark results

Hi,

I’m currently trying to replicate the CBT-CN benchmark found in the GPT-2 paper, but I’m having trouble doing so.

The paper states that in order to process the data for the CBT dataset:

Following the LM
approach introduced in the original paper, we compute the
probability of each choice and the rest of the sentence conditioned on this choice according to the LM, and predict
the one with the highest probability.

Based on the format of the dataset, I assumed this simply consisted of something like this:

def preprocess_function(examples):
    # Construct context strings for each example
    first_sentences = [[ tokenizer.eos_token+ tokenizer.eos_token.join(context)] * len(examples['options'][i]) for i,context in enumerate(examples["sentences"])]
    second_sentences = [
        [ question.replace('XXXXX',option) for option in examples['options'][i]] for i, question in enumerate(examples["question"])
    ]
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 10] for i in range(0, len(v), 10)] for k, v in tokenized_examples.items()}

Which is basically flattened_context|| $(delimiter) ||question with option filled in||

To predict accuracy, I used: AutoModelForCausalLM.from_pretrained(“gpt2”) and took the max of each of the 10 possibilities. This however resulted in a much lower accuracy than the paper, which has me very confused. Could someone point me in the right direction? Thanks