Using a BertWordPieceTokenizer trained from scratch from transformers

Hey everyone,

I’d like to load a BertWordPieceTokenizer I trained from scratch using the interface built in transformers, either with BertTokenizer or BertTokenizerFast. It looks like those two tokenizers in transformers expect different ways of loading in the saved data from BertWordPieceTokenizer, and I am wondering what is the best way to go about things.

Example

I am training on a couple test files, saving the tokenizer, and the reloading it in tokenizers.BertTokenizer (there is a bit of ceremony here creating the test data, but this is everything you need to reproduce the behavior I am seeing):

from pathlib import Path

from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer


def test_text():
    text = [
        "This is a test, just a test",
        "nothing more, nothing less"
    ]

    return text


def create_test_files():
    test_path = Path("tmp")
    test_path.mkdir()

    test_data = test_text()

    for idx, text in enumerate(test_data):
        file = test_path.joinpath(f"file{idx}.txt")
        with open(file, "w") as f:
            f.write(text)

    return test_path


def cleanup_test(path):
    path = Path(path)

    for child in path.iterdir():
        if child.is_file():
            child.unlink()
        else:
            rm_tree(child)

    path.rmdir()


def create_tokenizer_savepath():
    savepath = Path("./bert")
    savepath.mkdir()
    return str(savepath)


def main():
    # Saving two text files to train the tokenizer
    test_path = create_test_files()

    files = test_path.glob("**/*.txt")
    files = [str(f) for f in files]

    tokenizer = BertWordPieceTokenizer(
        clean_text=True,
        strip_accents=True,
        lowercase=True,
    )

    tokenizer.train(
        files,
        vocab_size=15,
        min_frequency=1,
        show_progress=True,
        special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
        limit_alphabet=1000,
        wordpieces_prefix="##",
    )

    savepath = create_tokenizer_savepath()
    tokenizer.save_model(savepath, "pubmed_bert")

    tokenizer = BertTokenizer.from_pretrained(
        f"{savepath}/pubmed_bert-vocab.txt",
        max_len=512
    )

    print(tokenizer)

    cleanup_test(test_path)
    cleanup_test(savepath)


if __name__ == "__main__":
    main()

Loading the Trained Tokenizer

Specifying the path to the pubmed_bert-vocab.txt is deprecated:

Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated
PreTrainedTokenizer(name_or_path='bert/pubmed_bert-vocab.txt', vocab_size=30, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

But, if I just specify the path to the directory containing pubmed_bert-vocab.txt:

Traceback (most recent call last):
  File "minimal_tokenizer.py", line 86, in <module>
    main()
  File "minimal_tokenizer.py", line 76, in main
    max_len=512
  File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'bert'. Make sure that:

- 'bert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'bert' is the correct path to a directory containing relevant tokenizer files

The directory I am saving to only contains pubmed_bert-vocab.txt. If specifying the full path to that vocab is deprecated, what is the best way to load that tokenizer?

Using BertTokenizerFast

If I swap out BertTokenizer for BertTokenizerFast, and pass in the path to the directory where I have saved my tokenizer trained from scratch, I get the same error:

Traceback (most recent call last):
  File "minimal_tokenizer.py", line 86, in <module>
    main()
  File "minimal_tokenizer.py", line 76, in main
    max_len=512
  File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'bert'. Make sure that:

- 'bert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'bert' is the correct path to a directory containing relevant tokenizer files

And if I specify the path to file saved by my tokenizer (pubmed_bert-vocab.txt), I get a ValueError (vs the deprecation warning I was getting using BertTokenizer):

Traceback (most recent call last):
  File "minimal_tokenizer.py", line 86, in <module>
    main()
  File "minimal_tokenizer.py", line 76, in main
    max_len=512
  File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1696, in from_pretrained
    "Use a model identifier or the path to a directory instead.".format(cls.__name__)
ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported.Use a model identifier or the path to a directory instead.

Current Approach

I am currently using BertTokenizer, specifying the full path the the pubmed_bert-vocab.txt and am ignoring the deprecation warning, but ideally I would like use BertTokenizerFast but don’t know how to load in my saved tokenizer. What is the best way to go forward on this?

1 Like

pinging @anthony

Should I open an issue on the tokenizers repository for this?