Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`

maifeeulasad · December 22, 2022, 7:37am

Is there any way, where I can tokenize texts from tf.string? Cause in this way, we can use transformers inside existing TensorFlow models, and it will be a lot faster.

This also leads to endless possibilities, as we will be able to use multiple models parallel with concat from TensorFlow

Let’s say I have this piece of code:

def get_model():
    text_input = Input(shape=(), dtype=tf.string, name='text')
    
    MODEL = "ping/pong"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    
    preprocessed_text = tokenizer(text_input)
    outputs = transformer_layer(preprocessed_text)
    
    output_sequence = outputs['sequence_output']
    x = Flatten()(output_sequence)
    x = Dense(NUM_CLASS,  activation='sigmoid')(x)

    model = Model(inputs=[text_input], outputs = [x])
    return model

But this gives me an error saying:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_27/788693747.py in <module>
      1 optimizer = Adam()
----> 2 model = get_model()
      3 model.compile(loss=CategoricalCrossentropy(from_logits=True),optimizer=optimizer,metrics=[Accuracy(), ],)
      4 model.summary()

/tmp/ipykernel_27/330097806.py in get_model()
      6 
      7     text_input = Input(shape=(), dtype=tf.string, name='text')
----> 8     preprocessed_text = tokenizer(text_input)
      9     outputs = transformer_layer(preprocessed_text)
     10 

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2466         if not _is_valid_text_input(text):
   2467             raise ValueError(
-> 2468                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2469                 "or `List[List[str]]` (batch of pretokenized examples)."
   2470             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

man0j · February 12, 2023, 1:51am

Do we have a solution for this issue? I’m facing the same issue.

AaronJosephMathew · February 12, 2023, 9:34am

I am facing the same issue, when I am trying to pass pandas column as an object.

AaronJosephMathew · February 12, 2023, 9:43am

This might be one way of doing in pandas, couldn’t convert object type to string and .values didn’t work the way I wanted.

token_val = [str(i) for i in df[‘col_name’].values]

And feed this into the tokenizer

luisbnzsa · February 24, 2023, 10:37am

Thanks. It worked for me. Above added if you have more than one column to tokenize

Prashant08 · March 2, 2023, 7:26am

I was getting the same issue while using finbert but when I read the error carefully I understood what’s the issue… The error itself is telling what it’s expecting i.e string and it is getting stuck because it is not finding string or list[list(str)] so if you’re applying the function on an excel column just add .dropna() at the end while reading the excel sheet which will drop all empty cells and make sure no of your rows in the column have only ‘integers’ and not string so that will again throw error so just convert your numbers to string before proceeding. This worked for me. I spent more than a week behind this only to realise in the end how small of an error it was. Felt like an idiot😅

AdveatPrasadKarnik · August 14, 2023, 4:29pm

Hi Prashant! I don’t have any empty cells in my column, yet i keep encountering this error. I have made X as a list(list(int)), and even tried list(list(str)) yet i am not able to use the .fit. I’d love to get your input of why this is the case

nayeem091 · August 26, 2023, 2:54pm

Hi,

I was facing a similar error. Then I rechecked my input files. Checks out it expects string type value into the tokenizer. The order of my X_train, X_test, y_train, and y_test files was wrong. Hence I was facing an error. I hope it helps.

skunkdara · November 10, 2023, 4:56am

I had the same issue. Turns out that one of the text strings “135123” was intrepeted as an int.
To check if this is the issue try
df.text.map(len) where text is the name of the text column. If there is an error then this could be the cause.
df.text = df.text.astype(str)
solved the issue

Laurie · March 26, 2024, 9:12am

When I chage

questions = df['question'].tolist()
answers = df['answer'].tolist()

to

questions = [str(i) for i in df['question'].tolist()]
answers = [str(i) for i in df['answer'].tolist()]

It runs smoothly, so we must keep the values be str().

anon39499327 · September 8, 2024, 5:20pm

def tokenize(batch):
texts = [str(text) for text in batch[“text”]] # convert all to str
return tokenizer(texts, padding=True, truncation=True)

emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

IT WORKS!!

Am09 · October 5, 2024, 3:48pm

def submit(title, text, file):
if(title!=“” and text!=“”):
input_text=title+" "+text
input_text=preprocessing_pipeline(input_text)
print(type(input_text))
input_text = tokenizer(str(input_text), return_tensors=‘pt’, max_length=512, truncation=True) why am i getting this error after passing a string

Topic		Replies	Views
Error in model.prepare_tf_dataset Beginners	4	250	June 14, 2024
How to use transformers&tensorflow for batch inference Beginners	0	527	August 20, 2021
How to get sp_model variable from T5Tokenizer? 🤗Tokenizers	1	1027	October 29, 2022
Can't run fine-tuning model? Models	1	331	May 19, 2023
Using `TFBertTokenizer` instead of `BertTokenizer` with `TFBertForQuestionAnswering` 🤗Tokenizers	1	1252	November 15, 2022

Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`

Related topics