Error in https://huggingface.co/learn/llm-course/chapter3/2?fw=pt#preprocessing-a-dataset

In this lesson Processing the data - Hugging Face LLM Course

There is the following code snippet:

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

However this throws the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[53], line 5
      3 checkpoint = "bert-base-uncased"
      4 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
----> 5 tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
      6 tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

File ~/hugging-face-transformers-course/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2911, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2909     if not self._in_target_context_manager:
   2910         self._switch_to_input_mode()
-> 2911     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2912 if text_target is not None:
   2913     self._switch_to_target_mode()

File ~/hugging-face-transformers-course/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2971, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   2968         return False
   2970 if not _is_valid_text_input(text):
-> 2971     raise ValueError(
   2972         "text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) "
   2973         "or `list[list[str]]` (batch of pretokenized examples)."
   2974     )
   2976 if text_pair is not None and not _is_valid_text_input(text_pair):
   2977     raise ValueError(
   2978         "text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) "
   2979         "or `list[list[str]]` (batch of pretokenized examples)."
   2980     )

ValueError: text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) or `list[list[str]]` (batch of pretokenized examples).

this returns a class of column type

tokenizer(raw_datasets["train"]["sentence1"])

is this course out of date?

1 Like

Hmm, it seems to be working. Maybe it’s because the dataset is different and the keys don’t match?

# pip install -U transformers datasets==3.6.0
from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("glue", "mrpc")

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
print(tokenized_sentences_1, tokenized_sentences_2) # {'input_ids': [[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, ...

just tried the exact code snippet you provided and had the same issue. when you say its working you mean that the tokenizer worked? maybe its something to do with kernel version?

1 Like

maybe its something to do with kernel version?

Yeah. My Python 3.9 with older Transformers works with the code above, but it seems to produce the same error as yours in Colab. Explicitly casting with list() seems to fix it for now.

from datasets import load_dataset
from transformers import AutoTokenizer
import platform, datasets, transformers

print("python", platform.python_version()) # 3.12.11
print("datasets", datasets.__version__) # 4.0.0
print("transformers", transformers.__version__) # 4.55.4

raw_datasets = load_dataset("glue", "mrpc")

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(list(raw_datasets["train"]["sentence1"]))
tokenized_sentences_2 = tokenizer(list(raw_datasets["train"]["sentence2"]))
print(tokenized_sentences_1, tokenized_sentences_2) # {'input_ids': [[101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, ...

Edit:
This seems to occur when using the datasets library version 4.0.0 or later. It can be avoided by casting or downgrading.

pip install datasets==3.6.0