Hi, I want to create vocab.json and merge.txt and use them with BartTokenizer.
But somehow tokenizer encode <s>
into [32, 87, 34]
which was originally [0].
Could you show me how to create vocab.json and merge.txt correctly.
or my way of loading vocab.json and merge.txt may be wrong.
Anyway here is what I did.
# in this notebook we'll only get one of the files (the Oscar one) for the sake of simplicity and performance
# !wget -c https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
# import
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path(".").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# check a sentence.
input1 = "Mi estas Julien."
tokenizer.encode("Mi estas Julien.").tokens
Output: ['Mi', 'Ä estas', 'Ä Juli', 'en', '.'] < looks good.
tokenizer.encode("Mi estas Julien.").ids
Output: [958, 316, 14540, 276, 18] < looks good
# check <s> and </s>
tokenizer.encode("<s>").ids, tokenizer.encode("</s>").ids
Output: ([0], [2]) < looks good
# save vocab and merge
!mkdir output
tokenizer.save_model("output","test")
# now let's load vocab.json and merge.txt
# import BartTokenizer
from transformers import BartTokenizer
tokenizer = BartTokenizer(
vocab_file="output/test-vocab.json",
merges_file="output/test-merges.txt",
bos_token="<s>",
eos_token="</s>",
sep_token="</s>",
cls_token="<s>",
unk_token="<unk>",
pad_token="<pad>",
mask_token="<mask>",
)
input1 = "Mi estas Julien."
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[ 958, 316, 14540, 276, 18]]) < looks good
input1 = "<s>Mi estas Julien.</s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[ 32, 87, 34, 958, 316, 14540, 276, 18, 918, 87, 34]]) < ?
# <s> is now [32, 87, 34] ???
input1 = "<s>"
encoded = tokenizer(input1, add_special_tokens=False, return_tensors="pt").input_ids
Output: tensor([[32, 87, 34]]) < ???
It seems encoding and decoding is working but only special_tokens is not working.
Could you give me hint to fix this problem?
My ultimate goal is to train Bart model with my language.
Or is it okay that tokenizer encode <s>
into 3 tokens?
Or can I modify vocab.json and merge.txt manually to let BartTokenizer encode <s>
into [0]
?
Thanks in advance.