Adding new tokens to a BERT tokenizer - Getting ValueError

kaankork · August 15, 2021, 10:49am

I have a Python list named unique_list that contains new words that will be added to my tokenizer using tokenizer.add_tokens. However, when I run my code I’m getting the following error:

File "/home/kaan/anaconda3/envs/env_backup/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 937, in add_tokens
    if not new_tokens:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

When I tested with a test array that contains 10 random words, it worked fine but the larger unique_list is causing a problem.

What am I doing wrong here?

impiyush · December 3, 2021, 10:16pm

Not sure if you already solved this issue, but I stumbled upon it today.

Looking closely at the error message, looks like the add_tokens() method expects the new_tokens passed to it as a python list rather than an numpy array. Converting new_tokens from numpy array to a list and then passing it resolved the issue.

added_tokens = tokenizer.add_tokens(new_tokens.tolist())

toyl · January 16, 2022, 7:06pm

excuse me if I need to solve the problem of not finding the word after tokenizing using BERT … can I use your solution

Topic		Replies	Views
Maybe there is a bug in BertTokenizer? 🤗Transformers	0	380	March 19, 2021
BertTokenizer.decode not understanding new vocabulary 🤗Tokenizers	0	349	December 1, 2023
Text Classification tokenizer problems on inference Intermediate	4	2272	October 12, 2022
Space token ' ' cannot be add when is_split_into_words = True 🤗Tokenizers	1	460	March 11, 2021
Getting Error while adding new tokens in vocab Beginners	2	2667	June 19, 2022

Adding new tokens to a BERT tokenizer - Getting ValueError

Related topics