Hi, there,
I try to train a RoBERTa model from scratch in the Chinese language.
The first step is to build a new tokenizer.
First, I followed the steps in the quicktour . After the tokenizer training is done, I use run_mlm.py to train the new model.
However, the RoBERTa model training fails and I found two observations:
- The output of
tokenzier(text)
is<s> </s>
. No matter what the text is, the output is always<s> </s>
. There is nothing encoded. - There is no Ġ symbol in the generated merges.txt file.
The merges.txt contains:
#version: 0.2 - Trained by huggingface/tokenizers
什 么
怎 么
可 以
手 机
...
The code I used to train tokenizer:
def build_BPE_tokenizer(
train_files: List[TextIO],
output_dir: TextIO,
# name: str,
vocab_size: int,
min_frequency: int):
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size=vocab_size, min_frequency=min_frequency,
special_tokens=["<s>", "<pad>", "</s>",
"<unk>", "<mask>"]
)
tokenizer.train(trainer, train_files)
tokenizer.model.save(output_dir)
And examples of training data:
喜欢 打篮球 的 男生 喜欢 什么样 的 女生
爱 打篮球 的 男生 喜欢 什么样 的 女生
我 手机 丢 了 , 我想 换个 手机
我想 买个 新手机 , 求 推荐
How can I fix the problem? Any help is appreciated!
Thanks for the help!