I’m trying to create tokenizer with my own dataset/vocabulary using Sentencepiece and then use it with AlbertTokenizer transformers.
I followed really closely the tutorial on how to train a model from scratch: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=hO5M3vrAhcuj
# import relevant libraries from pathlib import Path from tokenizers import SentencePieceBPETokenizer from tokenizers.implementations import SentencePieceBPETokenizer from tokenizers.processors import BertProcessing from transformers import AlbertTokenizer paths = [str(x) for x in Path("./data").glob("**/*.txt")] # Initialize a tokenizer tokenizer = SentencePieceBPETokenizer(add_prefix_space=True) # Customize training tokenizer.train(files=paths, vocab_size=32000, min_frequency=2, show_progress=True, special_tokens=['<unk>'],) # Saving model tokenizer.save_model("Sent-AlBERT") tokenizer = SentencePieceBPETokenizer( "./Sent-AlBERT/vocab.json", "./Sent-AlBERT/merges.txt",) tokenizer.enable_truncation(max_length=512)
Everything is fine up until this point when I tried to re-create the tokenizer in transformers
# Re-create our tokenizer in transformers tokenizer = AlbertTokenizer.from_pretrained("./Sent-AlBERT", do_lower_case=True)
This is the error message I kept receiving:
OSError: Model name './Sent-AlBERT' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed './Sent-AlBERT' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.
For some reason, it works with
RobertaTokenizerFast but not with
If anyone could give me a suggestion or any sort of direction on how to use
AlberTokenizer I would really appreciate it.
P.S: I also tried to use
DistilBertTokenizer, but it couldn’t recognize the tokenizer in the transformer either. I’m not sure what I am missing here.