Hi all
I have an issue that has been holding me up and driving me crazy for more than 24 hours.
In a nutshell: tokenizing a long dataset in a single pass of tokenizer()
produces different results compared to iterating through the dataset sample by sample, calling tokenizer()
on each. I do not understand why, and this is impacting my results.
I am working on a perplexity algorithm for quantised transformers models, copying the algorithm used by llama.cpp. I have the code working, but am now trying to integrate it with an LLM evaluation library that’s part of AutoGPTQ. In order to do this, I need my code to be able to work when the source dataset is iterated sample-by-sample.
I have so far tested only with Llama models, specifically Llama 7B.
The following simple code demonstrates the problem I am having:
from transformers import AutoTokenizer
from datasets import load_dataset
import torch
tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", use_fast=True)
wikidata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
wikilist = [' \n' if s == '' else s for s in wikidata['text']]
def tokenize_one_pass():
# Tokenize the full text in one pass
text = ''.join(wikilist)
method1_tokens = tokenizer(text, truncation=False, add_special_tokens=False, return_tensors='pt').input_ids
return method1_tokens[0]
def tokenize_by_sample():
# Iterate through each sample of the dataset, tokenizing one by one. Then concat at the end.
tokens = []
for sample in wikilist:
output = tokenizer(sample, truncation=False, add_special_tokens=False, return_tensors='pt').input_ids
tokens.append(output)
method2_tokens = torch.LongTensor()
for item in tokens: # Concat tokenized samples into one tensor
input_ids = item[0]
method2_tokens = torch.cat((method2_tokens, input_ids), dim=0)
return method2_tokens
m1_tokens = tokenize_one_pass()
m2_tokens = tokenize_by_sample()
# Method 2, sample-by-sample, returns 2881 more tokens
print("method 1 (tokenize in one pass) token len:", len(m1_tokens))
print("method 2 (tokenize sample-by-sample) token len:", len(m2_tokens)
print("m1 tokens:", end='')
for i, m1 in enumerate(m1_tokens):
if i < 20:
print(f"{m1}, ", end='')
print("\n\n\n")
print("m2 tokens:", end='')
for i, m2 in enumerate(m2_tokens):
if i < 20:
print(f"{m2}, ", end='')
print("\n")
# The tokens generated will be slightly different as well!
Here is the output it produces:
method 1 (tokenize in one pass) token len: 335687
method 2 (tokenize sample-by-sample) token len: 338568
m1 tokens:259, 13, 353, 4755, 350, 5059, 357, 353, 29871, 13, 29871, 13, 4755, 350, 5059, 357, 338, 385, 4223, 2706,
m2 tokens:259, 13, 29871, 353, 4755, 350, 5059, 357, 353, 29871, 13, 259, 13, 29871, 4755, 350, 5059, 357, 338, 385,
Notice that: method 2 returns more tokens, so it has added some extra tokens into the output somehow.
And we can see that in the sample of 20 tokens returned.
m1 is 259, 13, 353, 4755, ..
m2 is 259, 13, 29871, 353, 4755, ..
the token 29871 has been added.
Further debug showed me that 29871 was added at the beginning of sample 2:
sample1: tensor([[259, 13]])
sample2: tensor([[29871, 353, 4755, 350, 5059, 357, 353, 29871, 13]])
^^^^^
But it is not the BOS token (which anyway I am suppressing with add_special_tokens=False
).
Why is this happening? Why does tokenising in smaller samples result in extra tokens being added compared to tokenising in one long string? And is there anything I can do to avoid that (while still tokenising sample-by-sample)?
Thanks in advance for any help.