Receiving Error When trying to Tokenize Dataset with Distilbert

tk648 · August 28, 2022, 8:53am

I am totally new to all of this. I have just started using huggingface, and I am trying to use the DistilBert model. I was following along the textbook, Natural Language Processing with Transformers: Building Language Applications with Hugging Face that shows how to tokenize and then run it through DistilBert model. The dataset they used was one of the Huggingface hub’s datasets. I was able to replicate what I saw just fine with their dataset.

Now I am trying to use my own dataset and receive the error “TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]”, but if I add is_split_into_words=True, it turns the error message to “PreTokenizedEncodeInput must be Union[PreTokenizedInputSequence, Tuple[PreTokenizedInputSequence, PreTokenizedInputSequence]]”

I’ve spent the last several days trying to troubleshoot this error, including looking at others who have gotten this on this website but none looked similar to mine and combing through the guides and courses on huggingface. None has been helpful. I’m using jupyter notebooks in Google colab. Below is my code:

def tokenize(batch):
  return tokenizer(batch["content"], truncation=True, padding=True, is_split_into_words=True, return_tensors="pt")

print(tokenize(reviews["train"][:2]))

reviews_encoded = reviews.map(tokenize, batched=True, batch_size=None)

Thank you so much, any help is greatly appreciated. Seriously!!

Topic		Replies	Views
Importing a DistilBertTokenizer does not work using AutoTokenizer Beginners	0	651	November 8, 2023
Cannot encode/tokenize my Dataset Dictionary Beginners	1	1074	August 19, 2021
Why can't I pass my directly encoded inputs to a model? Beginners	5	4491	July 25, 2022
Error of 'input_ids' when using Transformers Trainer class with Encoder/Decoder model 🤗Transformers	0	1944	July 11, 2023
Big dataset when being tokenized using map function gives type error as TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] Beginners	0	199	August 6, 2024

Receiving Error When trying to Tokenize Dataset with Distilbert

Related topics