In Chapter 2, in the section on handling multiple sequences, the first code snippet is said to lead to an error due to a mismatch in the tensor shapes. I copied that very same code, but it does run without throwing an error. Can the models now automatically run on single sentences (as opposed to sequences of sentences)?
More problematically, however, I can’t explain why the following two methods for tokenization end up producing different logits.
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
# ### Method 1: Tokenize, convert to integers, tensorize, run the model
tokens_1 = tokenizer.tokenize(sequence)
ids_1 = tokenizer.convert_tokens_to_ids(tokens_1)
input_ids_1 = tf.constant([ids_1])
output_1 = model(input_ids_1)
# ### Method 2: Two-liner
inputs_2 = tokenizer([sequence], return_tensors="tf")
input_ids_2 = inputs_2.input_ids
output_2 = model(input_ids_2)
print(f"{output_1.logits = }")
print(f"{output_2.logits = }")
The output logits are
output_1.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.727623 , 2.8789375]], dtype=float32)>
output_2.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-1.5606974, 1.6122818]], dtype=float32)>
What is surprising is that both methods use the exact same tokenizer, and yet input_ids_1
is
<tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172,
2607, 2026, 2878, 2166, 1012]])>
whereas input_ids_2
instead includes a CLS (id 101) and SEP (102) token. Where did those come from?
<tf.Tensor: shape=(1, 16), dtype=int32, numpy=
array([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662,
12172, 2607, 2026, 2878, 2166, 1012, 102]])>