Hey everyone,
I’d like to load a BertWordPieceTokenizer
I trained from scratch using the interface built in transformers
, either with BertTokenizer
or BertTokenizerFast
. It looks like those two tokenizers in transformers
expect different ways of loading in the saved data from BertWordPieceTokenizer
, and I am wondering what is the best way to go about things.
Example
I am training on a couple test files, saving the tokenizer, and the reloading it in tokenizers.BertTokenizer
(there is a bit of ceremony here creating the test data, but this is everything you need to reproduce the behavior I am seeing):
from pathlib import Path
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer
def test_text():
text = [
"This is a test, just a test",
"nothing more, nothing less"
]
return text
def create_test_files():
test_path = Path("tmp")
test_path.mkdir()
test_data = test_text()
for idx, text in enumerate(test_data):
file = test_path.joinpath(f"file{idx}.txt")
with open(file, "w") as f:
f.write(text)
return test_path
def cleanup_test(path):
path = Path(path)
for child in path.iterdir():
if child.is_file():
child.unlink()
else:
rm_tree(child)
path.rmdir()
def create_tokenizer_savepath():
savepath = Path("./bert")
savepath.mkdir()
return str(savepath)
def main():
# Saving two text files to train the tokenizer
test_path = create_test_files()
files = test_path.glob("**/*.txt")
files = [str(f) for f in files]
tokenizer = BertWordPieceTokenizer(
clean_text=True,
strip_accents=True,
lowercase=True,
)
tokenizer.train(
files,
vocab_size=15,
min_frequency=1,
show_progress=True,
special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"],
limit_alphabet=1000,
wordpieces_prefix="##",
)
savepath = create_tokenizer_savepath()
tokenizer.save_model(savepath, "pubmed_bert")
tokenizer = BertTokenizer.from_pretrained(
f"{savepath}/pubmed_bert-vocab.txt",
max_len=512
)
print(tokenizer)
cleanup_test(test_path)
cleanup_test(savepath)
if __name__ == "__main__":
main()
Loading the Trained Tokenizer
Specifying the path to the pubmed_bert-vocab.txt
is deprecated:
Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated
PreTrainedTokenizer(name_or_path='bert/pubmed_bert-vocab.txt', vocab_size=30, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
But, if I just specify the path to the directory containing pubmed_bert-vocab.txt
:
Traceback (most recent call last):
File "minimal_tokenizer.py", line 86, in <module>
main()
File "minimal_tokenizer.py", line 76, in main
max_len=512
File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'bert'. Make sure that:
- 'bert' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'bert' is the correct path to a directory containing relevant tokenizer files
The directory I am saving to only contains pubmed_bert-vocab.txt
. If specifying the full path to that vocab is deprecated, what is the best way to load that tokenizer?
Using BertTokenizerFast
If I swap out BertTokenizer for BertTokenizerFast, and pass in the path to the directory where I have saved my tokenizer trained from scratch, I get the same error:
Traceback (most recent call last):
File "minimal_tokenizer.py", line 86, in <module>
main()
File "minimal_tokenizer.py", line 76, in main
max_len=512
File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'bert'. Make sure that:
- 'bert' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'bert' is the correct path to a directory containing relevant tokenizer files
And if I specify the path to file saved by my tokenizer (pubmed_bert-vocab.txt
), I get a ValueError (vs the deprecation warning I was getting using BertTokenizer):
Traceback (most recent call last):
File "minimal_tokenizer.py", line 86, in <module>
main()
File "minimal_tokenizer.py", line 76, in main
max_len=512
File "/home/ygx/opt/local/anaconda3/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1696, in from_pretrained
"Use a model identifier or the path to a directory instead.".format(cls.__name__)
ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported.Use a model identifier or the path to a directory instead.
Current Approach
I am currently using BertTokenizer
, specifying the full path the the pubmed_bert-vocab.txt
and am ignoring the deprecation warning, but ideally I would like use BertTokenizerFast
but don’t know how to load in my saved tokenizer. What is the best way to go forward on this?