Using `TFBertTokenizer` instead of `BertTokenizer` with `TFBertForQuestionAnswering`

SamAgarwal0 · November 15, 2022, 8:05pm

I am trying to use TFBertTokenizer instead of BertTokenizer with TFBertForQuestionAnswering, however, when I tokenize a text pair using TFBertTokenizer I get:

>>> from transformers import TFBertTokenizer, TFBertForQuestionAnswering
>>> import tensorflow as tf
>>> tf_tokenizer = TFBertTokenizer.from_pretrained('bert-base-uncased')
>>> tf_model = TFBertForQuestionAnswering.from_pretrained('bert-base-uncased')
All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
>>> tf_inputs = tf_tokenizer(['Who is Jim Henson?', 'Jim Henson is a puppet master'])
>>> print(tf_inputs)
{'input_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[  101,  2040,  2003,  3958, 27227,  1029,   102,     0],
       [  101,  3958, 27227,  2003,  1037, 13997,  3040,   102]],
      dtype=int64)>, 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)>, 'token_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)>}

VS

>>> from transformers import BertTokenizer, TFBertForQuestionAnswering
>>> import tensorflow as tf

>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> model = TFBertForQuestionAnswering.from_pretrained("bert-base-uncased")

>>> inputs = tokenizer("Who was Jim Henson?",  "Jim Henson was a nice puppet", return_tensors="tf")
>>> print(inputs)
{'input_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958, 27227,
         2001,  1037,  3835, 13997,   102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])>, 'attention_mask': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}

Basically, I get a (2,8) shape tensor from TFBertTokenizer vs a (1, 14) shape tensor with BertTokenizer. How can I get a (1, 14) shape tensor from TFBertTokenizer?

sgugger · November 15, 2022, 10:50pm

I’m not sure TFBertTokenizer supports sentence pairs yet, cc @Rocketknight1

Topic		Replies	Views
Using TFBertTokenizer with tf.data.Dataset 🤗Transformers	3	294	March 10, 2024
Application of TFBertTokenizer 🤗Tokenizers	0	442	November 21, 2022
ValueError: too many values to unpack (expected 2) when using BertTokenizer 🤗Transformers	6	8509	July 13, 2021
How to use transformers&tensorflow for batch inference Beginners	0	528	August 20, 2021
Allocation of 93763584 exceeds 10% of free system memory 🤗Transformers	0	1770	July 29, 2022

Using `TFBertTokenizer` instead of `BertTokenizer` with `TFBertForQuestionAnswering`

Related topics