In this lesson Processing the data - Hugging Face LLM Course
There is the following code snippet:
from transformers import AutoTokenizer
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
However this throws the following error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[53], line 5
3 checkpoint = "bert-base-uncased"
4 tokenizer = AutoTokenizer.from_pretrained(checkpoint)
----> 5 tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
6 tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])
File ~/hugging-face-transformers-course/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2911, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2909 if not self._in_target_context_manager:
2910 self._switch_to_input_mode()
-> 2911 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
2912 if text_target is not None:
2913 self._switch_to_target_mode()
File ~/hugging-face-transformers-course/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2971, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
2968 return False
2970 if not _is_valid_text_input(text):
-> 2971 raise ValueError(
2972 "text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) "
2973 "or `list[list[str]]` (batch of pretokenized examples)."
2974 )
2976 if text_pair is not None and not _is_valid_text_input(text_pair):
2977 raise ValueError(
2978 "text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) "
2979 "or `list[list[str]]` (batch of pretokenized examples)."
2980 )
ValueError: text input must be of type `str` (single example), `list[str]` (batch or single pretokenized example) or `list[list[str]]` (batch of pretokenized examples).
this returns a class of column type
tokenizer(raw_datasets["train"]["sentence1"])
is this course out of date?