Tokenization: different results when tokenizing in one pass vs sample-by-sample

Hi all

I have an issue that has been holding me up and driving me crazy for more than 24 hours.

In a nutshell: tokenizing a long dataset in a single pass of tokenizer() produces different results compared to iterating through the dataset sample by sample, calling tokenizer() on each. I do not understand why, and this is impacting my results.

I am working on a perplexity algorithm for quantised transformers models, copying the algorithm used by llama.cpp. I have the code working, but am now trying to integrate it with an LLM evaluation library that’s part of AutoGPTQ. In order to do this, I need my code to be able to work when the source dataset is iterated sample-by-sample.

I have so far tested only with Llama models, specifically Llama 7B.

The following simple code demonstrates the problem I am having:

from transformers import AutoTokenizer
from datasets import load_dataset
import torch

tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=True)

wikidata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
wikilist = [' \n' if s == '' else s for s in wikidata['text']]

def tokenize_one_pass():
    # Tokenize the full text in one pass
    text = ''.join(wikilist)

    method1_tokens = tokenizer(text, truncation=False, add_special_tokens=False, return_tensors='pt').input_ids

    return method1_tokens[0]

def tokenize_by_sample():
    # Iterate through each sample of the dataset, tokenizing one by one. Then concat at the end.
    tokens = []
    for sample in wikilist:
        output = tokenizer(sample, truncation=False, add_special_tokens=False, return_tensors='pt').input_ids
        tokens.append(output)

    method2_tokens = torch.LongTensor()
    for item in tokens: # Concat tokenized samples into one tensor
        input_ids = item[0]
        method2_tokens = torch.cat((method2_tokens, input_ids), dim=0)
    return method2_tokens

m1_tokens = tokenize_one_pass()
m2_tokens = tokenize_by_sample()

# Method 2, sample-by-sample, returns 2881 more tokens
print("method 1 (tokenize in one pass) token len:", len(m1_tokens))
print("method 2 (tokenize sample-by-sample) token len:", len(m2_tokens)
print("m1 tokens:", end='')
for i, m1 in enumerate(m1_tokens):
    if i < 20:
        print(f"{m1}, ", end='')

print("\n\n\n")

print("m2 tokens:", end='')
for i, m2 in enumerate(m2_tokens):
    if i < 20:
        print(f"{m2}, ", end='')
print("\n")
# The tokens generated will be slightly different as well!

Here is the output it produces:

method 1 (tokenize in one pass) token len: 335687
method 2 (tokenize sample-by-sample) token len: 338568
m1 tokens:259, 13, 353, 4755, 350, 5059, 357, 353, 29871, 13, 29871, 13, 4755, 350, 5059, 357, 338, 385, 4223, 2706,

m2 tokens:259, 13, 29871, 353, 4755, 350, 5059, 357, 353, 29871, 13, 259, 13, 29871, 4755, 350, 5059, 357, 338, 385,

Notice that: method 2 returns more tokens, so it has added some extra tokens into the output somehow.

And we can see that in the sample of 20 tokens returned.
m1 is 259, 13, 353, 4755, ..
m2 is 259, 13, 29871, 353, 4755, ..

the token 29871 has been added.

Further debug showed me that 29871 was added at the beginning of sample 2:

sample1: tensor([[259,  13]])
sample2: tensor([[29871,   353,  4755,   350,  5059,   357,   353, 29871,    13]])
                  ^^^^^

But it is not the BOS token (which anyway I am suppressing with add_special_tokens=False).

Why is this happening? Why does tokenising in smaller samples result in extra tokens being added compared to tokenising in one long string? And is there anything I can do to avoid that (while still tokenising sample-by-sample)?

Thanks in advance for any help.

2 Likes

Here is a really simple demonstration of the problem:

>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b")
>>> tokenizer(" \n", add_special_tokens=False).input_ids
[259, 13]
>>> tokenizer(" \n = Robert Boulter = \n", add_special_tokens=False).input_ids
[259, 13, 353, 4755, 350, 5059, 357, 353, 29871, 13]
>>> tokenizer(" = Robert Boulter = \n", add_special_tokens=False).input_ids
[29871, 353, 4755, 350, 5059, 357, 353, 29871, 13]

Where does that 29871 come from at the start of line 3’s output? It is not there in line 2, even though line 3 is a subset of the text in line 2.

OK… I think I finally understand

>>> tokenizer.decode([13])
'\n'
>>> tokenizer.decode([353])
'='
>>> tokenizer.decode([13, 353])
'\n ='

13 on its own is \n and 353 on its own is =, but 13, 353 together has a space in-between - \n =

So I suppose the 29871 is needed at the start to add the space in which is not there otherwise.

Meaning, more tokens are needed to encode a given piece of text when it’s broken into smaller chunks than would be when it’s done in a single pass.

So I probably can’t even do what I need to do, given I need tokenization to be 100% identical to the source material, which apparently is impossible unless I tokenize it in one pass, not sample by sample.

bro, i get the same problem with you, it makes me grazy

Anyone got an update on this?