Using a fixed vocab.txt with AutoTokenizer?


I have a special case where I want to use a hand-written vocab with a notebook that’s using AutoTokenizer but I can’t find a way to do this (it’s for a non-language sequence problem, where I’m pretraining very small models with a vocab designed to optimize sequence length, vocab size, and legibility).

If it’s not possible, what’s the best way to use my fixed vocab? In the past I used BertWordPieceTokenizer, loaded directly with the vocab.txt path, but I don’t know how to use this approach with newer Trainer-based approach in the notebook.

UPDATE: More specifically, if I try my old method of using BertWordPieceTokenizer(vocab='vocab.txt') it fails later with:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_3379/ in <module>
      3 # Setup train dataset if `do_train` is set.
      4 print('Creating train dataset...')
----> 5 train_dataset = get_dataset(model_data_args, tokenizer=tokenizer, evaluate=False) if training_args.do_train else None
      7 # Setup evaluation dataset if `do_eval` is set.

/tmp/ipykernel_3379/ in get_dataset(args, tokenizer, evaluate)
     32   if args.line_by_line:
     33     # Each example in data file is on each line.
---> 34     return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, 
     35                                  block_size=args.block_size)

~/anaconda3/envs/torch_17/lib/python3.8/site-packages/transformers/data/datasets/ in __init__(self, tokenizer, file_path, block_size)
    133             lines = [line for line in if (len(line) > 0 and not line.isspace())]
--> 135         batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
    136         self.examples = batch_encoding["input_ids"]
    137         self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]

TypeError: 'BertWordPieceTokenizer' object is not callable

The notebook I’m trying to use is from:

Okay, so obviously I’m not a Python guy… I see there’s some insanity in the language that allows class instances to be callable… (why, Python… WHY???) :sob: :rofl: …so I’m a bit stumped, but presumably it has to do with the fact that BertWordPieceTokenizer is not a subclass of PreTrainedTokenizer (which has the crazy attribute of being callable).

I’m really stuck. I’d just like to plug in my custom tokenizer, but it seems that when I hit “LineByLineTextDataset”, I’m going to hit the same callable error. I tried running with the default tokenization and although my vocab went down from 1073 to 399 tokens, my sequence length went from 128 to 833 tokens. Hence the desire to load my tokenizer from the hand-written vocab.


UPDATE: Okay, I hadn’t realized I could do it with BertTokenizerFast. I haven’t totally verified that this is working, but so far it looks correct. :crossed_fingers: