Creating a Custom Token Vocabulary for GPT-2

c-bone · January 6, 2025, 5:53pm

Hello everyone,

I apologize in advance, I’m new to the HF library. I am currently trying to implement a GPT-2 style model in with the transformers library. For a v1 of my model, I had trained it from scratch without HF, thus creating my own tokenizer etc.

Since my data is in a bit of an odd format (I am training on crystallographic information), I would like to make my own tokenizer but have been struggling with tokenizer training in HF.

I already have the vocabulary, token to id and id to token, etc. consisting of space groups, atoms, digits, coordinate letters…

I was wondering if it was possible to implement a tokenizer without having to pass by the phase of training one from an existing tokenizer, as my objective doesn’t seem to fit in with the existing ones.

(For info I will then be training a GPT-2 Model from scratch.

Thanks anyone for the help! I hope I was clear.

Alanturner2 · January 7, 2025, 1:50am

It sounds like you’re on the right track with trying to use your custom tokenizer for training a GPT-2 model from scratch, especially since your data has a specific domain (crystallographic information). You don’t need to train a tokenizer from scratch using an existing tokenizer, but instead, you can directly create a custom tokenizer that fits your vocabulary and tokenization needs. Here’s how you can approach it:

Custom Tokenizer: Since you already have the vocabulary and mappings (token_to_id and id_to_token), you can create a custom tokenizer using PreTrainedTokenizerFast from the Hugging Face library. This class allows you to implement your own tokenization logic without needing to train one from scratch.

Here’s a rough outline of how you can do this:

from transformers import PreTrainedTokenizerFast

# Define your custom vocabulary
vocab = {
    "H": 0,  # Example: Atom H -> Token ID 0
    "O": 1,  # Example: Atom O -> Token ID 1
    "COORD": 2,  # Example: Some crystallographic coordinate -> Token ID 2
    # Add your vocabulary mappings here
}

# Create a custom tokenizer class by extending PreTrainedTokenizerFast
class CustomTokenizer(PreTrainedTokenizerFast):
    def __init__(self, vocab):
        super().__init__()
        self.vocab = vocab
        self.id_to_token = {i: token for token, i in vocab.items()}
        self.token_to_id = vocab

    def encode(self, text, add_special_tokens=True):
        # Implement custom encoding logic
        return [self.token_to_id.get(token, self.token_to_id.get('<unk>', 0)) for token in text.split()]
    
    def decode(self, tokens, skip_special_tokens=False):
        # Implement custom decoding logic
        return ''.join([self.id_to_token.get(token_id, '<unk>') for token_id in tokens])

# Instantiate your custom tokenizer
tokenizer = CustomTokenizer(vocab)

Tokenization Logic: Since you’re working with crystallographic data, you might need to define specific tokenization rules (e.g., breaking down space groups, atoms, or coordinates into meaningful tokens). The encode and decode methods will allow you to handle the conversion from strings to token IDs and vice versa.

Training GPT-2: Once you’ve implemented the tokenizer, you can use it for training your GPT-2 model from scratch. You can pass your custom tokenizer to the GPT-2 model’s Trainer in Hugging Face, just like you would with any other tokenizer.

from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments

# Load the GPT-2 configuration
config = GPT2Config(vocab_size=len(vocab))  # Set vocab_size based on your custom tokenizer

# Initialize GPT-2 model
model = GPT2LMHeadModel(config)

# Define Trainer and training arguments
training_args = TrainingArguments(output_dir="./model", per_device_train_batch_size=4)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # Make sure to format your dataset
    tokenizer=tokenizer,
)

# Start training
trainer.train()

In summary, you don’t need to start from an existing tokenizer but can directly create your own custom one based on your domain-specific vocabulary. The Hugging Face library’s flexibility allows you to extend the tokenizer and model training to suit your needs without going through the tokenizer training phase that is required for some tasks.

Topic		Replies	Views
Training GPT-2 from scratch Beginners	2	1232	August 3, 2020
How does one create a custom hugging face model with a already working tokenizer? 🤗Transformers	1	969	July 29, 2024
Tokenizer Saving Issues, Wrapper Issues and Push to Hub issues Beginners	3	1496	May 12, 2024
How does one create a HF tokenizer with only a fraction of the vocabulary but without changing the model? Beginners	0	171	February 1, 2024
GPT2 Training from scratch in German 🤗Transformers	3	2313	October 3, 2020

Creating a Custom Token Vocabulary for GPT-2

Related topics