I got 'ValueError: You have to specify either input_ids or inputs_embeds' when I am training GPT2 using huggingface Trainer

fortuna · June 7, 2021, 1:49am

Below is the error code generated

ValueError                                Traceback (most recent call last)
<ipython-input-38-29d47e6260b2> in <module>()
----> 1 trainer.train( )

4 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/modeling_gpt2.py in forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, use_cache, output_attentions, output_hidden_states, return_dict)
    674             batch_size = inputs_embeds.shape[0]
    675         else:
--> 676             raise ValueError("You have to specify either input_ids or inputs_embeds")
    677 
    678         device = input_ids.device if input_ids is not None else inputs_embeds.device

and here is my code

from transformers import GPT2Tokenizer, GPT2Model, Trainer, trainer_utils, TrainingArguments
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

from datasets import Dataset
dataset = Dataset.from_text('/content/chatbot.txt')

trainArgs = TrainingArguments( output_dir = os.path.join( os.getcwd(), 'customGPT2' )
                              , overwrite_output_dir = True
                              , do_train=True
                              , do_eval=True
                              , evaluation_strategy='steps'
                              , per_device_train_batch_size=4
                              , per_device_eval_batch_size =4
                              , gradient_accumulation_steps=1
                              , eval_accumulation_steps=1
                              , weight_decay=0
                              , adam_epsilon= 1e-08
                              , max_grad_norm = 1.0
                              , num_train_epochs =3.0
                              , max_steps = -1
                              , lr_scheduler_type = trainer_utils.SchedulerType('linear')
                              , logging_dir = os.path.join( os.getcwd(), 'log' ) 
                              , logging_steps = 2000
                              , logging_strategy = 'steps'
                              , save_steps = 2000
                              , save_strategy = 'steps'
                              , seed = 66
                              , fp16 = False
                              , fp16_opt_level = 'O1')

trainer=Trainer( model, args = trainArgs,train_dataset=dataset)

As I am quite new using trainer I actually try to follow as closely to docs as possible and only change few thing such as dataset because I need to use local dataset but I make sure to import it to datasets.Dataset() to be the same as the format that docs require and changing some training arguments. Thank you

sgugger · June 7, 2021, 11:53am

You haven’t processed your dataset: it only contains the raw texts and not the input IDs the model expects. Have a look at the training tutorial to see how you can tokenize it!

fredchung · February 19, 2023, 8:12pm

Ran into the same issue and this post was helpful.
Looked into AutoTokenizer, which addressed the issue.

Topic		Replies	Views
Error of 'input_ids' when using Transformers Trainer class with Encoder/Decoder model 🤗Transformers	0	1953	July 11, 2023
I get a "You have to specify either input_ids or inputs_embeds" error, but I do specify the input ids Beginners	6	20926	October 31, 2021
ValueError with Trainer Beginners	0	428	August 29, 2023
Getting the following error "valueError: You have to specify either decoder_input_ids or decoder_inputs_embeds" Models	2	292	May 6, 2024
ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds 🤗Transformers	3	1760	November 14, 2023

I got 'ValueError: You have to specify either input_ids or inputs_embeds' when I am training GPT2 using huggingface Trainer

Related topics