Train T5 tokenizer

guicalabria · June 27, 2023, 6:39am

Hi everyone

I am trying to train the T5 model on the ARQMath corpus. This corpus contains a lot of math content written in LaTeX, which is not recognized by the original T5Tokenizer. So, my first step is to “fine-tune” a T5Tokenizer on the ARQMath corpus.

To this end, I followed the chapter 6 of HF’s NLP course. However, I am getting the following error message AttributeError: 'T5Tokenizer' object has no attribute 'train_new_from_iterator'. The code I am using is given below:

import pandas as pd
from transformers import AutoTokenizer
from datasets import Dataset
from tqdm.notebook import tqdm


df = pd.read_csv('../data/answers.tsv', sep='\t')
# remove rows where answer_body is empty
df = df.dropna(subset=['answer_body'])

raw_dataset = Dataset.from_pandas(df)

# Create a generator object
def get_training_corpus():
    for start_idx in tqdm(range(0, len(raw_dataset), 1000)):
        samples = raw_dataset[start_idx : start_idx + 1000]
        yield samples['answer_body']


training_corpus = get_training_corpus()

model_name = 'google/flan-t5-base'
t5_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

new_vocab_size = t5_tokenizer.vocab_size + 300

new_tokenizer = t5_tokenizer.train_new_from_iterator(
  training_corpus,
  new_vocab_size,
  show_progress=True,
)

In the resources section of the T5 model in the HF’s documentation, I’ve also found another way of training my tokenizer, but this method seems to train the tokenizer from scratch, which means that I would lose all the information gathered in the original T5Tokenizer, right?

Any help is appreciated Cheers!

guicalabria · July 1, 2023, 7:57am

I am now using this script to train both versions of the T5 tokenizer:

def main():
    # args parsing and imports were ommited to fit in discord
    col_name = args['col_name']
    model_name = args['model_name']
    output_dir = Path(args['output_dir'])
    dataset_dir = Path(args['dataset_dir'])
    vocab_size = args['vocab_size']
    batch_size = args['batch_size']

    # Constant
    input_sentence_size = None

    df = pd.read_csv(dataset_dir, sep='\t')
    dataset = Dataset.from_pandas(df)

    tokenizer = SentencePieceUnigramTokenizer(
        unk_token="<unk>", eos_token="</s>", pad_token="<pad>"
    )
    fast_tokenizer = T5TokenizerFast.from_pretrained(model_name)

    # Build an iterator over this dataset
    def batch_iterator(input_sentence_size=None):
        if input_sentence_size is None:
            input_sentence_size = len(dataset)
        batch_length = batch_size
        for i in range(0, input_sentence_size, batch_length):
            yield dataset[i : i + batch_length][col_name]

    if not output_dir.exists():
        output_dir.mkdir()

    tokenizer.train_from_iterator(
        iterator=batch_iterator(input_sentence_size=input_sentence_size),
        vocab_size=vocab_size,
        show_progress=True,
    )
    tokenizer.save(str(output_dir / "tokenizer.json"))

    new_fast_tokenizer = fast_tokenizer.train_new_from_iterator(
        batch_iterator(input_sentence_size=input_sentence_size),
        vocab_size,
        show_progress=True,
    )
    new_fast_tokenizer.save(str(output_dir / "fast_tokenizer.json"))

    config = T5Config.from_pretrained(model_name, vocab_size=tokenizer.get_vocab_size())
    config.save_pretrained(output_dir)

To train the slow tokenizer it only takes 10 min, but to train the fast tokenizer it takes 5 hours. Is that normal?

guicalabria · July 1, 2023, 7:59am

Also, is it okay to train one tokenizer from scratch and the other from an already existing checkpoint?

Hakase-Noonna · November 2, 2023, 2:24am

Have you solved this issue?
I also need to train the pre-trained tokenizer with new corpus just like your case. (Berttokenizer)
Thx!!

iRpro16 · May 31, 2024, 1:29am

I don’t think so. I am currently working on a project where I created a tokenizer from scratch, but threw an error when I fed it to the T5 model. And now I am seeing that this error can be from because using another tokenizer (apart from the one provided with the hugging face library), can cause some embedding ID’s to be out of the embedding range! Hope this helps.

Topic		Replies	Views
Training the t5 Beginners	4	1292	August 16, 2022
Errors when fine-tuning T5 Beginners	7	6410	January 3, 2022
Fine-tuning T5 for translation Beginners	0	1286	November 9, 2021
How to properly add news tokens to tokenizer vocab? Beginners	0	148	May 14, 2024
Need help in fine-tuning T5-Base Model for a sequence task Beginners	0	164	May 8, 2024

Train T5 tokenizer

Related topics