Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).`

Is there any way, where I can tokenize texts from tf.string? Cause in this way, we can use transformers inside existing TensorFlow models, and it will be a lot faster.

This also leads to endless possibilities, as we will be able to use multiple models parallel with concat from TensorFlow

Let’s say I have this piece of code:

def get_model():
    text_input = Input(shape=(), dtype=tf.string, name='text')
    MODEL = "ping/pong"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    preprocessed_text = tokenizer(text_input)
    outputs = transformer_layer(preprocessed_text)
    output_sequence = outputs['sequence_output']
    x = Flatten()(output_sequence)
    x = Dense(NUM_CLASS,  activation='sigmoid')(x)

    model = Model(inputs=[text_input], outputs = [x])
    return model

But this gives me an error saying:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_27/ in <module>
      1 optimizer = Adam()
----> 2 model = get_model()
      3 model.compile(loss=CategoricalCrossentropy(from_logits=True),optimizer=optimizer,metrics=[Accuracy(), ],)
      4 model.summary()

/tmp/ipykernel_27/ in get_model()
      7     text_input = Input(shape=(), dtype=tf.string, name='text')
----> 8     preprocessed_text = tokenizer(text_input)
      9     outputs = transformer_layer(preprocessed_text)

/opt/conda/lib/python3.7/site-packages/transformers/ in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2466         if not _is_valid_text_input(text):
   2467             raise ValueError(
-> 2468                 "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
   2469                 "or `List[List[str]]` (batch of pretokenized examples)."
   2470             )

ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Do we have a solution for this issue? I’m facing the same issue.

I am facing the same issue, when I am trying to pass pandas column as an object.

This might be one way of doing in pandas, couldn’t convert object type to string and .values didn’t work the way I wanted.

token_val = [str(i) for i in df[‘col_name’].values]

And feed this into the tokenizer

1 Like

Thanks. It worked for me. Above added if you have more than one column to tokenize

I was getting the same issue while using finbert but when I read the error carefully I understood what’s the issue… The error itself is telling what it’s expecting i.e string and it is getting stuck because it is not finding string or list[list(str)] so if you’re applying the function on an excel column just add .dropna() at the end while reading the excel sheet which will drop all empty cells and make sure no of your rows in the column have only ‘integers’ and not string so that will again throw error so just convert your numbers to string before proceeding. This worked for me. I spent more than a week behind this only to realise in the end how small of an error it was. Felt like an idiot😅

1 Like