Pre-training: ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

dlearner · May 17, 2023, 1:07am

Hello,
I am following the tutorial here to pre-train a bert model and somehow I got the following errors:
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

The code is very straightforward from the tutorial:

model = BertForMaskedLM.from_pretrained('bert-base-multilingual-uncased')
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
streaming_dataset = load_dataset('text', data_files='./train.txt', streaming=True, split="train")

training_args = TrainingArguments(
    output_dir='/project/bert/model',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy='steps',
    eval_steps=100,
    logging_steps=100,
    num_train_epochs=3,
    save_strategy='steps',
    save_steps=500,
    max_steps=1000,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer,
    mlm=True,
    mlm_probability=0.2
)

trainer = Trainer(
    model=model,
    tokenizer=bert_tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=streaming_dataset,
)

Does anyone have idea on the reason for this?

uzumakiusa · June 26, 2023, 8:37pm

You are loading the data directly, You need to encode the data before sending to the model. Looks like you are directly sending the data without any encoding to the model.

isemmanuelolowe · April 4, 2024, 11:51pm

This issue is directly involving streaming dataset. A quick fix is to not use streaming dataset as below

non_streaming_dataset = load_dataset('text', data_files='./train.txt', split="train")

milanalimova · February 4, 2025, 9:57am

Hi, I have a similar problem, could you please take a look at my error as well?

Topic		Replies	Views
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['tokens', 'id', 'space_after', 'ner_tags', 'ner_ids'] Intermediate	2	2423	April 21, 2023
Missing, yet not missing, input_ids 🤗Transformers	2	1340	June 14, 2024
Error of 'input_ids' when using Transformers Trainer class with Encoder/Decoder model 🤗Transformers	0	1960	July 11, 2023
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['0'] Beginners	0	682	May 6, 2023
Data collator issue: ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided [] 🤗Transformers	0	342	January 8, 2024

Pre-training: ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

Related topics