Chapter 2: Different logits for otherwise identical tokenization "pipelines"

In Chapter 2, in the section on handling multiple sequences, the first code snippet is said to lead to an error due to a mismatch in the tensor shapes. I copied that very same code, but it does run without throwing an error. Can the models now automatically run on single sentences (as opposed to sequences of sentences)?

More problematically, however, I can’t explain why the following two methods for tokenization end up producing different logits.

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# ### Method 1: Tokenize, convert to integers, tensorize, run the model
tokens_1 = tokenizer.tokenize(sequence)
ids_1 = tokenizer.convert_tokens_to_ids(tokens_1)
input_ids_1 = tf.constant([ids_1])
output_1 = model(input_ids_1)

# ### Method 2: Two-liner
inputs_2 = tokenizer([sequence], return_tensors="tf")
input_ids_2 = inputs_2.input_ids
output_2 = model(input_ids_2)

print(f"{output_1.logits = }")
print(f"{output_2.logits = }")

The output logits are

output_1.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.727623 ,  2.8789375]], dtype=float32)>
output_2.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-1.5606974,  1.6122818]], dtype=float32)>

What is surprising is that both methods use the exact same tokenizer, and yet input_ids_1 is

<tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
         2607,  2026,  2878,  2166,  1012]])>

whereas input_ids_2 instead includes a CLS (id 101) and SEP (102) token. Where did those come from?

<tf.Tensor: shape=(1, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102]])>

Method two uses the __call__ function of a tokenizer, which has a add_special_tokens parameter that controls automatic concatenation of a tokenizer’s BOS and EOS tokens.

Try ```tokenizer([sequence], return_tensors = ‘tf’, add_special_tokens = False)