Build a RoBERTa tokenizer from scratch

Hi, there,
I try to train a RoBERTa model from scratch in the Chinese language.
The first step is to build a new tokenizer.

First, I followed the steps in the quicktour . After the tokenizer training is done, I use run_mlm.py to train the new model.

However, the RoBERTa model training fails and I found two observations:

  1. The output of tokenzier(text) is <s> </s>. No matter what the text is, the output is always <s> </s>. There is nothing encoded.
  2. There is no Ġ symbol in the generated merges.txt file.

The merges.txt contains:

#version: 0.2 - Trained by huggingface/tokenizers
什 么
怎 么
可 以
手 机
...

The code I used to train tokenizer:

def build_BPE_tokenizer(
        train_files: List[TextIO],
        output_dir: TextIO,
        # name: str,
        vocab_size: int,
        min_frequency: int):

    tokenizer = Tokenizer(BPE())
    tokenizer.pre_tokenizer = Whitespace()

    trainer = BpeTrainer(
        vocab_size=vocab_size, min_frequency=min_frequency,
        special_tokens=["<s>", "<pad>", "</s>",
                        "<unk>", "<mask>"]
    )

    tokenizer.train(trainer, train_files)
    tokenizer.model.save(output_dir)

And examples of training data:

喜欢 打篮球 的 男生 喜欢 什么样 的 女生 
爱 打篮球 的 男生 喜欢 什么样 的 女生
我 手机 丢 了 , 我想 换个 手机 
我想 买个 新手机 , 求 推荐

How can I fix the problem? Any help is appreciated!
Thanks for the help!

Pinging @Narsil

Hi @flyaway,

I can’t reproduce your problem, What version of tokenizers are you using ?

 from tokenizers import Tokenizer, models, pre_tokenizers, trainers
import tokenizers


print(tokenizers.__version__)
# 0.9.4


def build_BPE_tokenizer(
    train_files,
    output_dir,
    # name: str,
    vocab_size: int,
    min_frequency: int,
):

    tokenizer = Tokenizer(models.BPE())
    tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

    trainer = trainers.BpeTrainer(
        vocab_size=vocab_size,
        min_frequency=min_frequency,
        special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
    )

    tokenizer.train(trainer, train_files)
    tokenizer.model.save(output_dir)
    return tokenizer

# Test.txt contains the examples you gave
tokenizer = build_BPE_tokenizer(["test.txt"], "out", 100, 1)

print(tokenizer.encode("喜欢 打篮球 的 男生 喜欢 什么样 的 女生 ").tokens)
# Output is ['喜欢', '打篮球', '的', '男生', '喜欢', '什么样', '的', '女生']

Are you sure you were using the correct tokenizer ?

@Narsil Thanks for the reply.
I think I may have misled you. The problems happened AFTER I used run_mlm.py. The script I used is like this:

output_dir="../embeddings/roberta-chinese/"
CUDA_VISIBLE_DEVICES=0 python ../run_mlm.py \
    --output_dir=$output_dir \
    --model_type=roberta \
    --do_train \
    --tokenizer_name=$output_dir \
    --config_name=$output_dir \
    --train_file=$train_file \
    --do_eval \
    --validation_file=$validation_file  \
    --line_by_line \
    --overwrite_output_dir \
    --max_steps=$max_steps

In the ../embeddings/roberta-chinese/, there is a config file which looks like this:

  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 30000
}

After training, I used the huggingface AutoTokenizer to load the trained model and tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('../embeddings/roberta-chinese')

encode_dict = tokenizer('这是 一个 测试')
print(encode_dict)  #{'input_ids': [0, 2], 'attention_mask': [1, 1]}

I do not understand the problem because I used the BERT (and WordPiece class in Tokenizer) model exactly the same way and it works fine.

if you consider the number of characters/ideogram in chinese does it makes any sense to have multiple character tokens? you should at least ensure that all CJKV characters are encoded to avoid any unencoded ones, I wonder what would be a reasonable number of tokens then

Ohhh I see.

You’re saving only tokenizer.model.save(..) in your script, but AutoTokenizer.from_pretrained() needs the full tokenizer (tokenizer.save('tokenizer.json') and put tokenizer.json in the correct directory.

To get a bit of background by what we call model vs full-tokenizer in the tokenizers library you can check : https://huggingface.co/docs/tokenizers/python/latest/pipeline.html.

Bonus: As small doctor script to get back your tokenizer from AutoTokenizer.from_pretrained:

from tokenizers import models, Tokenizer, pre_tokenizers                            
                                                             
# Those are the two files exported by the previous script and present in ../embeddings/roberta-chinese                       
tokenizer = Tokenizer(models.BPE.from_file("out/vocab.json", "out/merges.txt"))     
# Notice how we need to respecify this as it was not present in vocab.json or merges.txt
# Careful your special tokens are still missing at this point you should probably should readd them
# but you need to check your files to see what was done with them at learning time
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()                               
                                                                                    
tokenizer.save("out/tokenizer.json")                                                
                                                                                    
                                          
# Now let's check it works, tested on 4.0                                
from transformers import AutoTokenizer                                              
                                                                                    
tokenizer = AutoTokenizer.from_pretrained("out")                                    
print(tokenizer.encode("喜欢 打篮球 的 男生 喜欢 什么样 的 女生 "))
# [32, 40, 28, 39, 32, 35, 28, 36]
# Notice we don't need to specify `.tokens` compared to previous script ?
# ` transformers.tokenizer` and `tokenizers.tokenizer` are slightly different
#  in that regard (because of backward compatibility)

Is that clearer ?