BartTokenizer with vocab.json and merge.txt which were created by ByteLevelBPETokenizer encode <s> into 3 tokens

Hi, I want to create vocab.json and merge.txt and use them with BartTokenizer.
But somehow tokenizer encode <s> into [32, 87, 34] which was originally [0].

Could you show me how to create vocab.json and merge.txt correctly.
or my way of loading vocab.json and merge.txt may be wrong.

Anyway here is what I did.

# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
# !wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt

# import
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# check a sentence.

input1 = "Mi estas Julien."
tokenizer.encode("Mi estas Julien.").tokens
Output: ['Mi', 'Ä estas', 'Ä Juli', 'en', '.'] < looks good.

tokenizer.encode("Mi estas Julien.").ids
Output: [958, 316, 14540, 276, 18] < looks good

# check <s> and </s>
tokenizer.encode("<s>").ids, tokenizer.encode("</s>").ids
Output: ([0], [2]) < looks good

# save vocab and merge
!mkdir output
tokenizer.save_model("output","test")

# now let's load vocab.json and merge.txt

# import BartTokenizer
from transformers import BartTokenizer

tokenizer = BartTokenizer(
    vocab_file="output/test-vocab.json",
    merges_file="output/test-merges.txt",
    bos_token="<s>",
    eos_token="</s>",
    sep_token="</s>",
    cls_token="<s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>",
)

input1 = "Mi estas Julien."

encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[  958,   316, 14540,   276,    18]])  < looks good

input1 = "<s>Mi estas Julien.</s>"

encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[   32,    87,    34,   958,   316, 14540,   276,    18,   918,    87,  34]]) < ?

# <s> is now [32, 87, 34] ???
input1 = "<s>"

encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[32, 87, 34]]) < ???

It seems encoding and decoding is working but only special_tokens is not working.

Could you give me hint to fix this problem?
My ultimate goal is to train Bart model with my language.
Or is it okay that tokenizer encode <s> into 3 tokens?
Or can I modify vocab.json and merge.txt manually to let BartTokenizer encode <s> into [0] ?

Thanks in advance.

[UPDATED] I got a workaround.

It seems like initializing BartTokenizer from vocab.json and merge.txt cause the problem.
Even when I initialize BartTokenizer with vocab.json and merge.txt form Roberta’s pre-trained ones, same problem happened.

Here’s my codes.

# import BartTokenizer
from transformers import BartTokenizer

tokenizer = BartTokenizer(
    vocab_file="roberta/vocab.json",
    merges_file="roberta/merges.txt",
    bos_token="<s>",
    eos_token="</s>",
    sep_token="</s>",
    cls_token="<s>",
    unk_token="<unk>",
    pad_token="<pad>",
    mask_token="<mask>",
)

vocab.json and merge.txt was downloaded from below.
https://huggingface.co/roberta-base/resolve/main/vocab.json
https://huggingface.co/roberta-base/resolve/main/merges.txt

input1 = "This is a pen."
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[ 713,   16,   10, 7670,    4]])

input1 = "<s>This is a pen.</s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[41552,    29, 15698,   713,    16,    10,  7670, 49803,    29, 15698]]) < ???


input1 = "<s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[41552,    29, 15698]]) < ???

I got similar problem even with from_pretrained.

tokenizer = BartTokenizer.from_pretrained('facebook/bart-base', add_prefix_space=True)

input1 = "This is a pen."
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[ 152,   16,   10, 7670,    4]])

input1 = "<s> This is a pen.</s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[1437,    0,  152,   16,   10, 7670,    4,    2]]) < ???

encoded = tokenizer("<s>", add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[1437,    0]]) <<< ???

But when I use AutoTokenizer, it works fine.

from transformers import AutoTokenizer

# tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "facebook/bart-base",
)

input1 = "This is a pen."
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[ 713,   16,   10, 7670,    4]])

input1 = "<s>This is a pen.</s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[   0,  713,   16,   10, 7670,    4,    2]])

encoded = tokenizer("<s>", add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[0]])  < looks good

And I found a workaround.
Saving pre-trained tokenizer model first and replacing vocab.json and merge.txt with the files created by ByteLevelBPETokenizer works.

# save tokenizer model.
tokenizer.save_pretrained("./saved_model")

# replace vocab.json and merge.txt

# load tokenizer model 
tokenizer = AutoTokenizer.from_pretrained('./saved_model/')

input1 = "Mi estas Julien."
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[  958,   316, 14540,   276,    18]])

input1 = "<s>Mi estas Julien.</s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[    0,   958,   316, 14540,   276,    18,     2]])

input1 = "<s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[0]]) < looks good!
1 Like