Train T5 tokenizer

Hi everyone :slight_smile:

I am trying to train the T5 model on the ARQMath corpus. This corpus contains a lot of math content written in LaTeX, which is not recognized by the original T5Tokenizer. So, my first step is to “fine-tune” a T5Tokenizer on the ARQMath corpus.

To this end, I followed the chapter 6 of HF’s NLP course. However, I am getting the following error message AttributeError: 'T5Tokenizer' object has no attribute 'train_new_from_iterator'. The code I am using is given below:

import pandas as pd
from transformers import AutoTokenizer
from datasets import Dataset
from tqdm.notebook import tqdm


df = pd.read_csv('../data/answers.tsv', sep='\t')
# remove rows where answer_body is empty
df = df.dropna(subset=['answer_body'])

raw_dataset = Dataset.from_pandas(df)

# Create a generator object
def get_training_corpus():
    for start_idx in tqdm(range(0, len(raw_dataset), 1000)):
        samples = raw_dataset[start_idx : start_idx + 1000]
        yield samples['answer_body']


training_corpus = get_training_corpus()

model_name = 'google/flan-t5-base'
t5_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

new_vocab_size = t5_tokenizer.vocab_size + 300

new_tokenizer = t5_tokenizer.train_new_from_iterator(
  training_corpus,
  new_vocab_size,
  show_progress=True,
)

In the resources section of the T5 model in the HF’s documentation, I’ve also found another way of training my tokenizer, but this method seems to train the tokenizer from scratch, which means that I would lose all the information gathered in the original T5Tokenizer, right?

Any help is appreciated :hugs: Cheers! :beer:

I am now using this script to train both versions of the T5 tokenizer:

def main():
    # args parsing and imports were ommited to fit in discord
    col_name = args['col_name']
    model_name = args['model_name']
    output_dir = Path(args['output_dir'])
    dataset_dir = Path(args['dataset_dir'])
    vocab_size = args['vocab_size']
    batch_size = args['batch_size']

    # Constant
    input_sentence_size = None

    df = pd.read_csv(dataset_dir, sep='\t')
    dataset = Dataset.from_pandas(df)

    tokenizer = SentencePieceUnigramTokenizer(
        unk_token="<unk>", eos_token="</s>", pad_token="<pad>"
    )
    fast_tokenizer = T5TokenizerFast.from_pretrained(model_name)

    # Build an iterator over this dataset
    def batch_iterator(input_sentence_size=None):
        if input_sentence_size is None:
            input_sentence_size = len(dataset)
        batch_length = batch_size
        for i in range(0, input_sentence_size, batch_length):
            yield dataset[i : i + batch_length][col_name]

    if not output_dir.exists():
        output_dir.mkdir()

    tokenizer.train_from_iterator(
        iterator=batch_iterator(input_sentence_size=input_sentence_size),
        vocab_size=vocab_size,
        show_progress=True,
    )
    tokenizer.save(str(output_dir / "tokenizer.json"))

    new_fast_tokenizer = fast_tokenizer.train_new_from_iterator(
        batch_iterator(input_sentence_size=input_sentence_size),
        vocab_size,
        show_progress=True,
    )
    new_fast_tokenizer.save(str(output_dir / "fast_tokenizer.json"))

    config = T5Config.from_pretrained(model_name, vocab_size=tokenizer.get_vocab_size())
    config.save_pretrained(output_dir)

To train the slow tokenizer it only takes 10 min, but to train the fast tokenizer it takes 5 hours. Is that normal?

Also, is it okay to train one tokenizer from scratch and the other from an already existing checkpoint?

Have you solved this issue?
I also need to train the pre-trained tokenizer with new corpus just like your case. (Berttokenizer)
Thx!! :wink: