Hello,
I have a special case where I want to use a hand-written vocab with a notebook that’s using AutoTokenizer but I can’t find a way to do this (it’s for a non-language sequence problem, where I’m pretraining very small models with a vocab designed to optimize sequence length, vocab size, and legibility).
If it’s not possible, what’s the best way to use my fixed vocab? In the past I used BertWordPieceTokenizer, loaded directly with the vocab.txt path, but I don’t know how to use this approach with newer Trainer-based approach in the notebook.
UPDATE: More specifically, if I try my old method of using BertWordPieceTokenizer(vocab='vocab.txt')
it fails later with:
TypeError Traceback (most recent call last)
/tmp/ipykernel_3379/2783002494.py in <module>
3 # Setup train dataset if `do_train` is set.
4 print('Creating train dataset...')
----> 5 train_dataset = get_dataset(model_data_args, tokenizer=tokenizer, evaluate=False) if training_args.do_train else None
6
7 # Setup evaluation dataset if `do_eval` is set.
/tmp/ipykernel_3379/2486475202.py in get_dataset(args, tokenizer, evaluate)
32 if args.line_by_line:
33 # Each example in data file is on each line.
---> 34 return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path,
35 block_size=args.block_size)
36
~/anaconda3/envs/torch_17/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py in __init__(self, tokenizer, file_path, block_size)
133 lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
134
--> 135 batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
136 self.examples = batch_encoding["input_ids"]
137 self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]
TypeError: 'BertWordPieceTokenizer' object is not callable
The notebook I’m trying to use is from: github.com/gmihaila/ml_things.git