Using a fixed vocab.txt with AutoTokenizer?

jbmaxwell · September 12, 2021, 9:56pm

Hello,

I have a special case where I want to use a hand-written vocab with a notebook that’s using AutoTokenizer but I can’t find a way to do this (it’s for a non-language sequence problem, where I’m pretraining very small models with a vocab designed to optimize sequence length, vocab size, and legibility).

If it’s not possible, what’s the best way to use my fixed vocab? In the past I used BertWordPieceTokenizer, loaded directly with the vocab.txt path, but I don’t know how to use this approach with newer Trainer-based approach in the notebook.

UPDATE: More specifically, if I try my old method of using BertWordPieceTokenizer(vocab='vocab.txt') it fails later with:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_3379/2783002494.py in <module>
      3 # Setup train dataset if `do_train` is set.
      4 print('Creating train dataset...')
----> 5 train_dataset = get_dataset(model_data_args, tokenizer=tokenizer, evaluate=False) if training_args.do_train else None
      6 
      7 # Setup evaluation dataset if `do_eval` is set.

/tmp/ipykernel_3379/2486475202.py in get_dataset(args, tokenizer, evaluate)
     32   if args.line_by_line:
     33     # Each example in data file is on each line.
---> 34     return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, 
     35                                  block_size=args.block_size)
     36 

~/anaconda3/envs/torch_17/lib/python3.8/site-packages/transformers/data/datasets/language_modeling.py in __init__(self, tokenizer, file_path, block_size)
    133             lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
    134 
--> 135         batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
    136         self.examples = batch_encoding["input_ids"]
    137         self.examples = [{"input_ids": torch.tensor(e, dtype=torch.long)} for e in self.examples]

TypeError: 'BertWordPieceTokenizer' object is not callable

The notebook I’m trying to use is from: github.com/gmihaila/ml_things.git

jbmaxwell · September 13, 2021, 12:12am

Okay, so obviously I’m not a Python guy… I see there’s some insanity in the language that allows class instances to be callable… (why, Python… WHY???) …so I’m a bit stumped, but presumably it has to do with the fact that BertWordPieceTokenizer is not a subclass of PreTrainedTokenizer (which has the crazy attribute of being callable).

I’m really stuck. I’d just like to plug in my custom tokenizer, but it seems that when I hit “LineByLineTextDataset”, I’m going to hit the same callable error. I tried running with the default tokenization and although my vocab went down from 1073 to 399 tokens, my sequence length went from 128 to 833 tokens. Hence the desire to load my tokenizer from the hand-written vocab.

Aack!

UPDATE: Okay, I hadn’t realized I could do it with BertTokenizerFast. I haven’t totally verified that this is working, but so far it looks correct.

Topic		Replies	Views
Using a BertWordPieceTokenizer trained from scratch from transformers 🤗Tokenizers	2	4980	March 26, 2021
Load pretrained model's tokenizer with or without vocabulary? Beginners	2	144	August 30, 2024
Using a fixed vocabulary? Intermediate	2	930	November 8, 2021
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	930	November 17, 2021
Train a new tokenizer from scratch 🤗Transformers	4	1707	November 10, 2020

Using a fixed vocab.txt with AutoTokenizer?

Related topics