Receiving Error When trying to Tokenize Dataset with Distilbert

I am totally new to all of this. I have just started using huggingface, and I am trying to use the DistilBert model. I was following along the textbook, Natural Language Processing with Transformers: Building Language Applications with Hugging Face that shows how to tokenize and then run it through DistilBert model. The dataset they used was one of the Huggingface hub’s datasets. I was able to replicate what I saw just fine with their dataset.

Now I am trying to use my own dataset and receive the error “TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]”, but if I add is_split_into_words=True, it turns the error message to “PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]”

I’ve spent the last several days trying to troubleshoot this error, including looking at others who have gotten this on this website but none looked similar to mine and combing through the guides and courses on huggingface. None has been helpful. I’m using jupyter notebooks in Google colab. Below is my code:

def tokenize(batch):
  return tokenizer(batch["content"], truncation=True, padding=True, is_split_into_words=True, return_tensors="pt")

print(tokenize(reviews["train"][:2]))

reviews_encoded = reviews.map(tokenize, batched=True, batch_size=None)

Thank you so much, any help is greatly appreciated. Seriously!! :sweat_smile:

1 Like