Missing, yet not missing, input_ids

jdslatermd · March 16, 2023, 4:20pm

OK, I admit defeat. The short version is that I start with a pre-trained MLM model, bert-large-uncased-whole-word-masking, then fine tune it with a bunch of documents, then save it. That all works. I then load the saved model as “BertForQuestionAnswering.from_pretrained” and use the “BertTokenizer.from_pretrained”, then go on to load and tokenize a bunch of questions/contexts/answers to fine tune it. At every point I check, the input_ids are included, yet when I finally try to start training, I get an error “ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided [‘start_positions’, ‘end_positions’]”. Nothing I’ve tried fixes the error - I’m assuming it’s due to some fundamental ignorance on my part. I’ll try including the code that generates the examples, then the code that generates the dataset from the examples (happy to add anything else that is needed):

examples = []
for i in range(len(questions)):
    question = questions[i]
    context = contexts[i]
    answer = answers[i]
    text = question
    text_pair = f"{context} {answer}"
    encoded_example = tokenizer.encode_plus(
        text,
        text_pair,
        max_length=512,
        truncation='only_second', # or 'only_first'
        padding='max_length',
        return_tensors='pt',
        return_overflowing_tokens=True,
        add_special_tokens = True
    )
    example = {'question': question,
               'answer': answer,
               'context': context,
               'input_ids': encoded_example['input_ids'].tolist(),
               'attention_mask': encoded_example['attention_mask'].tolist(),
               'overflowing_tokens': encoded_example['overflowing_tokens']}
    examples.append(example)

class MathQADataset(DataProcessor):
    def __init__(self, examples, tokenizer, max_length):
        self.examples = examples
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):
        example = self.examples[idx]
        question = example['question']
        context = example['context']
        answer = example['answer']
    
        # Tokenize the question, context and the answer separately
        question_encodings = self.tokenizer(question, max_length=self.max_length, padding='max_length', truncation=True)
        context_encodings = self.tokenizer(context, max_length=self.max_length, padding='max_length', truncation=True)
        answer_encodings = self.tokenizer(answer, max_length=self.max_length, padding='max_length', truncation=True)
         
        # Concatenate the question, context and answer input sequences
        # combine the question, context and answer encodings
        input_ids = question_encodings['input_ids'] + answer_encodings['input_ids'][1:]
        attention_mask = question_encodings['attention_mask'] + answer_encodings['attention_mask'][1:]

        # Truncate the concatenated inputs to the maximum length of 512
        input_ids = input_ids[:512]
        attention_mask = attention_mask[:512]       
        
        # Compute the start and end positions of the answer within the concatenated input sequence
        answer_start_idx = len(question_encodings['input_ids']) - 1
        answer_end_idx = answer_start_idx + len(answer_encodings['input_ids'][1:]) - 1
        start_positions = torch.tensor([answer_start_idx])
        end_positions = torch.tensor([answer_end_idx])

        # Create a dictionary of encodings for this example
        encodings = {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'start_positions': start_positions,
            'end_positions': end_positions,
        }

        # Add 'token_type_ids' to the encodings if it exists
        if 'token_type_ids' in question_encodings:
            encodings['token_type_ids'] = question_encodings['token_type_ids']
            
        return encodings

Any suggestions will be gratefully received.

Jeremy

danx416 · June 13, 2024, 6:49pm

Did you get any response yet? I am also having a similar issue

RaushanTurganbay · June 14, 2024, 5:19am

You can take a look at this tutorial on fine-tuning a QA model with HF Trainer

Topic		Replies	Views
TypeError: forward() got an unexpected keyword argument 'start_positions' 🤗Transformers	5	6808	June 28, 2021
Impossible questions when finetuning QA models 🤗Transformers	0	290	November 19, 2021
Finetuning Bert for Question answering task without context Models	1	617	June 21, 2024
Input format for T5 model in Question Answering task 🤗Transformers	0	748	February 3, 2023
Pre-training: ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided [] Beginners	3	3610	February 4, 2025

Missing, yet not missing, input_ids

Related topics