Replicating GPT-2 CBT-CN benchmark results

tmleong · March 19, 2024, 3:39am

Hi,

I’m currently trying to replicate the CBT-CN benchmark found in the GPT-2 paper, but I’m having trouble doing so.

The paper states that in order to process the data for the CBT dataset:

Following the LM
approach introduced in the original paper, we compute the
probability of each choice and the rest of the sentence conditioned on this choice according to the LM, and predict
the one with the highest probability.

Based on the format of the dataset, I assumed this simply consisted of something like this:

def preprocess_function(examples):
    # Construct context strings for each example
    first_sentences = [[ tokenizer.eos_token+ tokenizer.eos_token.join(context)] * len(examples['options'][i]) for i,context in enumerate(examples["sentences"])]
    second_sentences = [
        [ question.replace('XXXXX',option) for option in examples['options'][i]] for i, question in enumerate(examples["question"])
    ]
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 10] for i in range(0, len(v), 10)] for k, v in tokenized_examples.items()}

Which is basically flattened_context|| $(delimiter) ||question with option filled in||

To predict accuracy, I used: AutoModelForCausalLM.from_pretrained(“gpt2”) and took the max of each of the 10 possibilities. This however resulted in a much lower accuracy than the paper, which has me very confused. Could someone point me in the right direction? Thanks

Topic		Replies	Views
I need help getting more accurate results after training Beginners	0	56	August 25, 2024
Train GPT2 on wikitext from scratch Beginners	5	3838	October 25, 2021
Replication of the performance of RoBERTa on the COPA task Models	0	543	December 19, 2022
Train a CausalLM for machine translation Beginners	1	133	January 1, 2025
Building a GPT2 dataset from long sequences 🤗Datasets	1	516	September 19, 2022

Replicating GPT-2 CBT-CN benchmark results

Related topics