I got 'ValueError: You have to specify either input_ids or inputs_embeds' when I am training GPT2 using huggingface Trainer

Below is the error code generated

ValueError                                Traceback (most recent call last)
<ipython-input-38-29d47e6260b2> in <module>()
----> 1 trainer.train( )

4 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    674             batch_size = inputs_embeds.shape[0]
    675         else:
--> 676             raise ValueError("You have to specify either input_ids or inputs_embeds")
    677 
    678         device = input_ids.device if input_ids is not None else inputs_embeds.device

and here is my code

from transformers import GPT2Tokenizer, GPT2Model, Trainer, trainer_utils, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

from datasets import Dataset
dataset = Dataset.from_text('/content/chatbot.txt')

trainArgs = TrainingArguments( output_dir = os.path.join( os.getcwd(), 'customGPT2' )
                              , overwrite_output_dir = True
                              , do_train=True
                              , do_eval=True
                              , evaluation_strategy='steps'
                              , per_device_train_batch_size=4
                              , per_device_eval_batch_size =4
                              , gradient_accumulation_steps=1
                              , eval_accumulation_steps=1
                              , weight_decay=0
                              , adam_epsilon= 1e-08
                              , max_grad_norm = 1.0
                              , num_train_epochs =3.0
                              , max_steps = -1
                              , lr_scheduler_type = trainer_utils.SchedulerType('linear')
                              , logging_dir = os.path.join( os.getcwd(), 'log' ) 
                              , logging_steps = 2000
                              , logging_strategy = 'steps'
                              , save_steps = 2000
                              , save_strategy = 'steps'
                              , seed = 66
                              , fp16 = False
                              , fp16_opt_level = 'O1')

trainer=Trainer( model, args = trainArgs,train_dataset=dataset)

As I am quite new using trainer I actually try to follow as closely to docs as possible and only change few thing such as dataset because I need to use local dataset but I make sure to import it to datasets.Dataset() to be the same as the format that docs require and changing some training arguments. Thank you

You haven’t processed your dataset: it only contains the raw texts and not the input IDs the model expects. Have a look at the training tutorial to see how you can tokenize it!

1 Like

Ran into the same issue and this post was helpful.
Looked into AutoTokenizer, which addressed the issue.