Chapter 2: Different logits for otherwise identical tokenization "pipelines"

housebunting · April 28, 2024, 8:54pm

In Chapter 2, in the section on handling multiple sequences, the first code snippet is said to lead to an error due to a mismatch in the tensor shapes. I copied that very same code, but it does run without throwing an error. Can the models now automatically run on single sentences (as opposed to sequences of sentences)?

More problematically, however, I can’t explain why the following two methods for tokenization end up producing different logits.

import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# ### Method 1: Tokenize, convert to integers, tensorize, run the model
tokens_1 = tokenizer.tokenize(sequence)
ids_1 = tokenizer.convert_tokens_to_ids(tokens_1)
input_ids_1 = tf.constant([ids_1])
output_1 = model(input_ids_1)

# ### Method 2: Two-liner
inputs_2 = tokenizer([sequence], return_tensors="tf")
input_ids_2 = inputs_2.input_ids
output_2 = model(input_ids_2)

print(f"{output_1.logits = }")
print(f"{output_2.logits = }")

The output logits are

output_1.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-2.727623 ,  2.8789375]], dtype=float32)>
output_2.logits = <tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[-1.5606974,  1.6122818]], dtype=float32)>

What is surprising is that both methods use the exact same tokenizer, and yet input_ids_1 is

<tf.Tensor: shape=(1, 14), dtype=int32, numpy=
array([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
         2607,  2026,  2878,  2166,  1012]])>

whereas input_ids_2 instead includes a CLS (id 101) and SEP (102) token. Where did those come from?

<tf.Tensor: shape=(1, 16), dtype=int32, numpy=
array([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662,
        12172,  2607,  2026,  2878,  2166,  1012,   102]])>

Chahnwoo · April 29, 2024, 2:55am

Method two uses the __call__ function of a tokenizer, which has a add_special_tokens parameter that controls automatic concatenation of a tokenizer’s BOS and EOS tokens.

Try ```tokenizer([sequence], return_tensors = ‘tf’, add_special_tokens = False)

Topic		Replies	Views
Chapter 2 questions Course	98	9117	June 1, 2025
Chapter 6 questions Course	51	5119	February 27, 2025
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` 🤗Tokenizers	11	19889	October 5, 2024
Logits from generate and model call different 🤗Transformers	2	914	January 26, 2025
Shape mismatch between labels and logits 🤗Transformers	1	1680	December 27, 2023

Chapter 2: Different logits for otherwise identical tokenization "pipelines"

Related topics