How to train a LlamaTokenizer?

nicholasKluge · December 6, 2023, 8:02pm

I have been trying to train a LlamaTokenizer but I keep running into infinite training times and out of memory problems. For some reason, my script consumes a lot of RAM.

Can someone help me? I am trying to train a LlamaTokenizer in Portuguese so my language model (to be trained) is compatible with the entire Llama ecosystem.

Here is my script:

import yaml
import argparse
from tqdm import tqdm

import torch
import datasets
from datasets import load_dataset 

from transformers import (
    AutoTokenizer,
    TrainingArguments,
)

from specifications import ModelArguments, DataTrainingArguments, ExtraArguments

def main(spec_file):
   
    # Load the arguments from the spec file
    with open(spec_file, "r") as stream:
        kwargs = yaml.safe_load(stream)
    
    # Get the arguments for the model, data, training, and extra
    model_args = ModelArguments(**kwargs['model_args'])
    data_args = DataTrainingArguments(**kwargs['data_args'])
    training_args = TrainingArguments(**kwargs['training_args'])
    extra_args = ExtraArguments(**kwargs['extra_args'])

    # Load the dataset from the huggingface Hub and prepare it for training
    if data_args.dataset_name is not None and not data_args.dataset_is_tokenized:
        dataset = load_dataset(data_args.dataset_name, 
            split=data_args.dataset_split, 
            use_auth_token=training_args.hub_token if training_args.hub_token else None,
            cache_dir=model_args.cache_dir,
            streaming=data_args.streaming,
        )
    else:
        raise ValueError("No dataset name provided or dataset is already tokenized") 

    # Remove non text columns
    dataset = dataset.remove_columns([col for col in dataset.column_names if col != "text"])

    # create a python generator to dynamically load the data
    def batch_iterator(batch_size=10000):
        for i in tqdm(range(0, len(dataset), batch_size)):
            yield dataset[i : i + batch_size]["text"]
    
    # Set the configuration kwargs for the tokenizer
    tokenizer_kwargs = {
        "cache_dir": model_args.cache_dir,
        "revision": model_args.model_revision,
        "use_auth_token": training_args.hub_token,
        "trust_remote_code": model_args.trust_remote_code,
        "bos_token": model_args.bos_token,
        "unk_token": model_args.unk_token,
        "eos_token": model_args.eos_token,
        "pad_token": model_args.eos_token,
    }

    # Create a tokenizer from the model checkpoint you want to train
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name, 
        **tokenizer_kwargs,
    )

    new_tokenizer = tokenizer.train_new_from_iterator(
        text_iterator=batch_iterator(), 
        vocab_size=model_args.vocab_size,
    )

    # Replace the new_tokenizer `max_model_input_sizes` for the `data_args.block_size`
    new_tokenizer.max_model_input_sizes.clear()
    new_tokenizer.max_model_input_sizes[extra_args.logger_name] = data_args.block_size
    new_tokenizer.model_max_length = tokenizer.model_max_length
    new_tokenizer.name_or_path = training_args.hub_model_id + "-tokenizer"

    # Save the new tokenizer
    new_tokenizer.save_pretrained(training_args.output_dir)
    
    # If hub_token is passed, upload the tokenizer to the hub
    if training_args.hub_token is not None and training_args.hub_model_id is not None:
        
        new_tokenizer.push_to_hub(
            repo_id=training_args.hub_model_id + '-tokenizer',
            use_auth_token=training_args.hub_token,
            commit_message=f"Trained tokenizer from scratch on {data_args.dataset_name}",
        )

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train a new Llama tokenizer")
    parser.add_argument("--spec-file", help="Path to the spec YAML file")
    args = parser.parse_args()
    main(args.spec_file)

My dataset was less than 3M lines/rows. The only time I was able to make this script work was when I reduced the dataset to 100 lines. But that is to little.

Note: This script works fine when using a GPT2 tokenizer as the initial tokenizer. Is the sentencepience approach that much slower/memory hungry?

Does anyone know what is going on?!

nicholasKluge · December 9, 2023, 1:00am

In case anyone also wants to train one of these, this is how I managed:

import json
import yaml
import argparse
from tqdm import tqdm

from datasets import load_dataset
from tokenizers import SentencePieceBPETokenizer
from transformers import LlamaTokenizerFast, TrainingArguments, AutoTokenizer

from specifications import ModelArguments, DataTrainingArguments, ExtraArguments

def main(spec_file):
    
    # Load the arguments from the spec file
    with open(spec_file, "r") as stream:
        kwargs = yaml.safe_load(stream)
    
    # Get the arguments for the model, data, training, and extra
    model_args = ModelArguments(**kwargs['model_args'])
    data_args = DataTrainingArguments(**kwargs['data_args'])
    training_args = TrainingArguments(**kwargs['training_args'])
    extra_args = ExtraArguments(**kwargs['extra_args'])

    # Load the dataset from the huggingface Hub and prepare it for training
    if data_args.dataset_name is not None and not data_args.dataset_is_tokenized:
        dataset = load_dataset(data_args.dataset_name, 
            split=data_args.dataset_split, 
            use_auth_token=training_args.hub_token if training_args.hub_token else None,
            cache_dir=model_args.cache_dir,
            streaming=data_args.streaming,
        )
    else:
        raise ValueError("No dataset name provided or dataset is already tokenized") 

    # Remove non text columns
    dataset = dataset.remove_columns([col for col in dataset.column_names if col != "text"])

    # select 2_000_000 random samples from the dataset
    dataset = dataset.shuffle(seed=training_args.seed).select(range(2_000_000))

    # Create a SentencePieceBPETokenizer
    tokenizer = SentencePieceBPETokenizer()

    # Train the SentencePieceBPETokenizer on the dataset
    tokenizer.train_from_iterator(
        iterator=dataset['text'],
        vocab_size=32_000,
        show_progress=True,
        special_tokens=["<unk>", "<s>", "</s>",  "<pad>"],
    )

    # Save the tokenizer
    tokenizer.save(extra_args.logger_name + "-sentencepiece-tokenizer.json", pretty=True)

    # Load the new tokenizer as a LlamaTokenizerFast
    new_llama_tokenizer = LlamaTokenizerFast(
        tokenizer_file=extra_args.logger_name + "-sentencepiece-tokenizer.json",
        name_or_path=training_args.hub_model_id + "-tokenizer",
        unk_token="<unk>",
        unk_token_id=0,
        bos_token="<s>",
        bos_token_id=1,
        eos_token="</s>",
        eos_token_id=2,
        pad_token="<pad>",
        pad_token_id=3,
        padding_side="right",
        max_model_input_sizes={extra_args.logger_name: data_args.block_size},
    )

    # Save the new tokenizer
    new_llama_tokenizer.save_pretrained(extra_args.logger_name + "-tokenizer")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train a new Llama tokenizer")
    parser.add_argument("--spec-file", help="Path to the spec YAML file")
    args = parser.parse_args()
    main(args.spec_file)

It takes some time, but at least it gives you a tokenizer.

vingpan · January 14, 2024, 3:41am

Hi,

Thanks for sharing this! Great starting point for me. Would it be possible to share a sample spec file?

Thanks!

nicholasKluge · January 14, 2024, 8:31pm

You can use it without one. Just pass the arguments you want directly. Here is a code snippet you can use:

import json
import argparse
from tqdm import tqdm

from datasets import load_dataset
from tokenizers import SentencePieceBPETokenizer
from transformers import LlamaTokenizerFast, TrainingArguments, AutoTokenizer

def main(args):

    # Load the dataset from the huggingface Hub and prepare it for training
    if args.dataset_name is not None:
        dataset = load_dataset(args.dataset_name, 
            split=args.dataset_split, 
            token=args.hub_token if args.hub_token else None,
        )
    else:
        raise ValueError("No dataset name provided or dataset is already tokenized") 

    # Remove non text columns
    dataset = dataset.remove_columns([col for col in dataset.column_names if col != "text"])

    # select `num_samples` from the dataset
    dataset = dataset.shuffle(seed=42).select(range(arg.num_samples))

    # Create a SentencePieceBPETokenizer
    tokenizer = SentencePieceBPETokenizer()

    # Train the SentencePieceBPETokenizer on the dataset
    tokenizer.train_from_iterator(
        iterator=dataset['text'],
        vocab_size=args.vocab_size,
        show_progress=True,
        special_tokens=["<unk>", "<s>", "</s>",  "<pad>"],
    )

    # Save the tokenizer
    tokenizer.save("new-sentencepiece-tokenizer.json", pretty=True)

    # Load reference tokenizer
    if args.reference_tokenizer is not None and args.hub_token is not None:
        reference_tokenizer = AutoTokenizer.from_pretrained(args.reference_tokenizer, token=args.hub_token if args.hub_token else None)
        reference_tokenizer.save_pretrained("reference-tokenizer")
    else:
        raise ValueError("No tokenizer name provided or no hub token provided. Try using `--reference_tokenizer 'meta-llama/Llama-2-7b-hf'")

    # Read and dump the json file for the new tokenizer and the reference tokenizer
    with open("new-sentencepiece-tokenizer.json") as f:
        new_llama_tokenizer_json = json.load(f)

    with open("reference-tokenizer/tokenizer.json") as f:
        reference_tokenizer_json = json.load(f)
    
    # Add the reference tokenizer's config to the new tokenizer's config
    new_llama_tokenizer_json["normalizer"] = reference_tokenizer_json["normalizer"]
    new_llama_tokenizer_json["pre_tokenizer"] = reference_tokenizer_json["pre_tokenizer"]
    new_llama_tokenizer_json["post_processor"] = reference_tokenizer_json["post_processor"]
    new_llama_tokenizer_json["decoder"] = reference_tokenizer_json["decoder"]
    new_llama_tokenizer_json["model"]['fuse_unk'] = reference_tokenizer_json["model"]['fuse_unk']
    new_llama_tokenizer_json["model"]['byte_fallback'] = reference_tokenizer_json["model"]['byte_fallback']

    # Dump the new tokenizer's config
    with open("new-sentencepiece-tokenizer.json", "w") as f:
        json.dump(new_llama_tokenizer_json, f, indent=2, ensure_ascii=False)

    # Load the new tokenizer as a LlamaTokenizerFast
    new_llama_tokenizer = LlamaTokenizerFast(
        tokenizer_file="new-sentencepiece-tokenizer.json",
        unk_token="<unk>",
        unk_token_id=0,
        bos_token="<s>",
        bos_token_id=1,
        eos_token="</s>",
        eos_token_id=2,
        pad_token="<pad>",
        pad_token_id=3,
        padding_side="right",
    )

    # Save the new tokenizer
    new_llama_tokenizer.save_pretrained("new-llama-tokenizer")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train a new Llama tokenizer")
    parser.add_argument(
        "--dataset_name",
        type=str,
        default=None,
        help="The name of the dataset to be tokenized",
    )
    parser.add_argument(
        "--dataset_split",
        type=str,
        default=None,
        help="The split of the dataset to be tokenized",
    )
    parser.add_argument(
        "--hub_token",
        type=str,
        default=None,
        help="The token to access the dataset on the hub",
    )
    parser.add_argument(
        "--reference_tokenizer",
        type=str,
        default=None,
        help="The name of the reference tokenizer to use",
    )
    parser.add_argument(
        "--num_samples",
        type=int,
        default=None,
        help="Number of samples to use from the dataset",
    )
    parser.add_argument(
        "--vocab_size",
        type=int,
        default=None,
        help="Vocabulary size to use for the tokenizer",
    )
    args = parser.parse_args()
    main(args)

# How to run:
# python train_sentencepiece.py --dataset_name "NeelNanda/pile-10k" --dataset_split "train" --hub_token "hf_..." --reference_tokenizer "meta-llama/Llama-2-7b-hf" --num_samples 2000000 --vocab_size 32000

Hope it helps!

amitagh · March 22, 2024, 2:14pm

Is there a way to update the vocablary by adding new tokens instead of creating a new tokenizer altogether?

TristanBehrens · April 15, 2024, 12:29pm

So cool! This is very helpful, @nicholasKluge!

Do you see a way to save the “tokenizer.model” file? Looks like the LlamaTokenizer class could do this.

github.com

huggingface/transformers/blob/06b1192768220b77d8f5a22031ed081e79df1616/src/transformers/models/llama/tokenization_llama.py

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

This file has been truncated. show original

But I did not manage to integrate it into above code.

Cheers!
Tristan

nicholasKluge · April 16, 2024, 1:23pm

Thanks @TristanBehrens!

Yes, that is a shortcoming of my implementation (it only gives you a fast tokenizer). Training a slow tokenizer is just (you guessed it) slow, while converting a fast tokenizer to a slow one is something I could not do until now.

I’m hoping someone cracks this for the rest of us (and share it!) …

TristanBehrens · April 16, 2024, 3:46pm

@nicholasKluge. I think I got it!

You can train tokenizers with sentencepiece:

This will give you the tokenizer.model file. It can be loaded with LlamaTokenizer.

nicholasKluge · April 16, 2024, 4:00pm

Do you have some code snippets we can use to reproduce this training?

TristanBehrens · April 16, 2024, 4:21pm

Here it is:

github.com

google/sentencepiece/blob/master/python/README.md#model-training

# SentencePiece Python Wrapper

Python wrapper for SentencePiece. This API will offer the encoding, decoding and training of Sentencepiece.

## Build and Install SentencePiece
For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module.

```
% pip install sentencepiece
```

To build and install the Python wrapper from source, try the following commands to build and install wheel package.
```
% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake .. -DSPM_ENABLE_SHARED=OFF -DCMAKE_INSTALL_PREFIX=./root
% make install
% cd ../python

This file has been truncated. show original

It is most trivial. Which gives me mixed feelings becaused I worked on this for weeks

nicholasKluge · April 16, 2024, 4:40pm

I’m sorry, but I still don’t understand your solution. Could you please give the code to train a tokenizer on, for example, NeelNanda/pile-10k? If you can provide a working Colab notebook, that would help even more.

TristanBehrens · April 16, 2024, 5:02pm

Piece of cake! Google Colab

nicholasKluge · April 16, 2024, 5:49pm

Very nice! Were you able to convert this tokenizer into a fast one?

TristanBehrens · April 17, 2024, 11:46am

A fast tokenizer was never a requirement of mine. Thus I did not try. Should work like you outlined above. Wanna give it a try?

nicholasKluge · April 17, 2024, 10:56pm

In the end, you just have to load the new tokenizer as a fast one, and save it, and all is done:

# Load the new sp tokenizer
tokenizer = LlamaTokenizer("./tokenizer.model")

# Save it.
tokenizer.save_pretrained("./new-llama-tokenizer")

# Create a new tokenizer using the `LlamaTokenizerFast` class.
tokenizer = LlamaTokenizerFast("./new-llama-tokenizer/tokenizer.model")

# Save it.
tokenizer.save_pretrained("./new-llama-tokenizer")

And done! Now, you have a fast and slow tokenizer in the “new-llama-tokenizer” folder. This implementation is way simpler than my own. Congrats to @TristanBehrens for showing the way.

TristanBehrens · April 18, 2024, 5:53am

@nicholasKluge it was a pleasure working with you!

MikeMpapa · June 7, 2024, 8:52pm

Thank you @nicholasKluge & @TristanBehrens for the very insightful discussion!!Vey cool. I do need to do a similar thing for training a model on a “custom” language. I have followed the steps described in this post and trained my tokenizer.

However, it is not intuitive to me how it will be combined with a Llama model if I need to fine-tune it. The pre-trained model will still expect the token IDs from the original tokenizer, while my new tokenizer will have completely new token ordering. Am I missing something here or my only option to train a Llama from scratch with the new tokenizer?

Do you have an example where It shows how to combine a custom tokenizer with a Llama training job?

Thank you very much for all the insights already!!

MikeMpapa · June 7, 2024, 9:47pm

nicholasKluge:

# Add the reference tokenizer's config to the new tokenizer's config
    new_llama_tokenizer_json["normalizer"] = reference_tokenizer_json["normalizer"]
    new_llama_tokenizer_json["pre_tokenizer"] = reference_tokenizer_json["pre_tokenizer"]
    new_llama_tokenizer_json["post_processor"] = reference_tokenizer_json["post_processor"]
    new_llama_tokenizer_json["decoder"] = reference_tokenizer_json["decoder"]
    new_llama_tokenizer_json["model"]['fuse_unk'] = reference_tokenizer_json["model"]['fuse_unk']
    new_llama_tokenizer_json["model"]['byte_fallback'] = reference_tokenizer_json["model"]['byte_fallback']

Actually, is this the piece where you combine the original with the new tokenizer? Any hint is greatly appreciated!

TristanBehrens · June 8, 2024, 5:49am

I have doubts that using a pre-trained model with a new custom tokenizer would work.

nicholasKluge · June 8, 2024, 9:06am

Your intuition, @MikeMpapa, and @TristanBehrens comments are correct. It does not work.

A tokenizer is just a look-up table that maps pieces of words (or words themselves) into an index of the embedding matrix. This index contains the embedding vector, which was learned during training.

Changing this mapping “breaks” the model.

If you train a new tokenizer, you cannot simply use it on an already pre-trained model with its own tokenizer. A new tokenizer is something you do when you want to train a new model from scratch. Once that model learns to map specific tokens to specific embeddings, we don’t go and change the tokenizer.

You can continually expand a tokenizer but remember that you also need to broaden the embedding matrix (and train those embeddings) if you wish your model to learn how to “use them” correctly.

About these lines of code:

new_llama_tokenizer_json["normalizer"] = reference_tokenizer_json["normalizer"]
new_llama_tokenizer_json["pre_tokenizer"] = reference_tokenizer_json["pre_tokenizer"]
new_llama_tokenizer_json["post_processor"] = reference_tokenizer_json["post_processor"]
new_llama_tokenizer_json["decoder"] = reference_tokenizer_json["decoder"]
new_llama_tokenizer_json["model"]['fuse_unk'] = reference_tokenizer_json["model"]['fuse_unk']
new_llama_tokenizer_json["model"]['byte_fallback'] = reference_tokenizer_json["model"]['byte_fallback']

Here, we are just inheriting some hypersettings from the original llama tokenizer (e.g., do byte fallback if < unk> token appears). None of this is related to the actual learning of the look-up table/mapping between pieces of words and integers.

I hope this explanation helps!

Topic		Replies	Views
Prompt printing gibberish Beginners	1	681	September 15, 2023
Llama model outputs strange words Beginners	0	128	December 1, 2024
Simple use of Transformers breaks Beginners	1	1379	June 2, 2023
Qlora Training on Custom Trainer Research	0	29	September 19, 2024
"invalid kernel image" when using HF llama trainer 🤗Transformers	1	413	February 23, 2024

How to train a LlamaTokenizer?

Related topics