I am trying to use TFBertTokenizer
instead of BertTokenizer
with TFBertForQuestionAnswering
, however, when I tokenize a text pair using TFBertTokenizer
I get:
>>> from transformers import TFBertTokenizer, TFBertForQuestionAnswering
>>> import tensorflow as tf
>>> tf_tokenizer = TFBertTokenizer.from_pretrained('bert-base-uncased')
>>> tf_model = TFBertForQuestionAnswering.from_pretrained('bert-base-uncased')
All model checkpoint layers were used when initializing TFBertForQuestionAnswering.
Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
>>> tf_inputs = tf_tokenizer(['Who is Jim Henson?', 'Jim Henson is a puppet master'])
>>> print(tf_inputs)
{'input_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[ 101, 2040, 2003, 3958, 27227, 1029, 102, 0],
[ 101, 3958, 27227, 2003, 1037, 13997, 3040, 102]],
dtype=int64)>, 'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 0],
[1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)>, 'token_type_ids': <tf.Tensor: shape=(2, 8), dtype=int64, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)>}
VS
>>> from transformers import BertTokenizer, TFBertForQuestionAnswering
>>> import tensorflow as tf
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
>>> model = TFBertForQuestionAnswering.from_pretrained("bert-base-uncased")
>>> inputs = tokenizer("Who was Jim Henson?", "Jim Henson was a nice puppet", return_tensors="tf")
>>> print(inputs)
{'input_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[ 101, 2040, 2001, 3958, 27227, 1029, 102, 3958, 27227,
2001, 1037, 3835, 13997, 102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])>, 'attention_mask': <tf.Tensor: shape=(1, 14), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
Basically, I get a (2,8) shape tensor from TFBertTokenizer
vs a (1, 14) shape tensor with BertTokenizer
. How can I get a (1, 14) shape tensor from TFBertTokenizer
?