Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`

Is there any way, where I can tokenize texts from tf.string? Cause in this way, we can use transformers inside existing TensorFlow models, and it will be a lot faster.

This also leads to endless possibilities, as we will be able to use multiple models parallel with concat from TensorFlow

Letā€™s say I have this piece of code:

def get_model():
    text_input = Input(shape=(), dtype=tf.string, name='text')
    
    MODEL = "ping/pong"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    
    preprocessed_text = tokenizer(text_input)
    outputs = transformer_layer(preprocessed_text)
    
    output_sequence = outputs['sequence_output']
    x = Flatten()(output_sequence)
    x = Dense(NUM_CLASS,  activation='sigmoid')(x)

    model = Model(inputs=[text_input], outputs = [x])
    return model

But this gives me an error saying:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_27/788693747.py in <module>
      1 optimizer = Adam()
----> 2 model = get_model()
      3 model.compile(loss=CategoricalCrossentropy(from_logits=True),optimizer=optimizer,metrics=[Accuracy(), ],)
      4 model.summary()

/tmp/ipykernel_27/330097806.py in get_model()
      6 
      7     text_input = Input(shape=(), dtype=tf.string, name='text')
----> 8     preprocessed_text = tokenizer(text_input)
      9     outputs = transformer_layer(preprocessed_text)
     10 

/opt/conda/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2466         if not _is_valid_text_input(text):
   2467             raise ValueError(
-> 2468                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2469                 "or `List[List[str]]` (batch of pretokenized examples)."
   2470             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Do we have a solution for this issue? Iā€™m facing the same issue.

I am facing the same issue, when I am trying to pass pandas column as an object.

This might be one way of doing in pandas, couldnā€™t convert object type to string and .values didnā€™t work the way I wanted.

token_val = [str(i) for i in df[ā€˜col_nameā€™].values]

And feed this into the tokenizer

1 Like

Thanks. It worked for me. Above added if you have more than one column to tokenize

I was getting the same issue while using finbert but when I read the error carefully I understood whatā€™s the issueā€¦ The error itself is telling what itā€™s expecting i.e string and it is getting stuck because it is not finding string or list[list(str)] so if youā€™re applying the function on an excel column just add .dropna() at the end while reading the excel sheet which will drop all empty cells and make sure no of your rows in the column have only ā€˜integersā€™ and not string so that will again throw error so just convert your numbers to string before proceeding. This worked for me. I spent more than a week behind this only to realise in the end how small of an error it was. Felt like an idiotšŸ˜…

1 Like

Hi Prashant! I donā€™t have any empty cells in my column, yet i keep encountering this error. I have made X as a list(list(int)), and even tried list(list(str)) yet i am not able to use the .fit. Iā€™d love to get your input of why this is the case

Hi,

I was facing a similar error. Then I rechecked my input files. Checks out it expects string type value into the tokenizer. The order of my X_train, X_test, y_train, and y_test files was wrong. Hence I was facing an error. I hope it helps.

I had the same issue. Turns out that one of the text strings ā€œ135123ā€ was intrepeted as an int.
To check if this is the issue try
df.text.map(len) where text is the name of the text column. If there is an error then this could be the cause.
df.text = df.text.astype(str)
solved the issue

When I chage

questions = df['question'].tolist()
answers = df['answer'].tolist()

to

questions = [str(i) for i in df['question'].tolist()]
answers = [str(i) for i in df['answer'].tolist()]

It runs smoothly, so we must keep the values be str().