Pre-training: ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

Hello,
I am following the tutorial here to pre-train a bert model and somehow I got the following errors:
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided []

The code is very straightforward from the tutorial:

model = BertForMaskedLM.from_pretrained('bert-base-multilingual-uncased')
bert_tokenizer = BertTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
streaming_dataset = load_dataset('text', data_files='./train.txt', streaming=True, split="train")

training_args = TrainingArguments(
    output_dir='/project/bert/model',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    evaluation_strategy='steps',
    eval_steps=100,
    logging_steps=100,
    num_train_epochs=3,
    save_strategy='steps',
    save_steps=500,
    max_steps=1000,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer,
    mlm=True,
    mlm_probability=0.2
)

trainer = Trainer(
    model=model,
    tokenizer=bert_tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=streaming_dataset,
)

Does anyone have idea on the reason for this?

You are loading the data directly, You need to encode the data before sending to the model. Looks like you are directly sending the data without any encoding to the model.

This issue is directly involving streaming dataset. A quick fix is to not use streaming dataset as below

non_streaming_dataset = load_dataset('text', data_files='./train.txt', split="train")