Creating a Custom Token Vocabulary for GPT-2

Hello everyone,

I apologize in advance, I’m new to the HF library. I am currently trying to implement a GPT-2 style model in with the transformers library. For a v1 of my model, I had trained it from scratch without HF, thus creating my own tokenizer etc.

Since my data is in a bit of an odd format (I am training on crystallographic information), I would like to make my own tokenizer but have been struggling with tokenizer training in HF.

I already have the vocabulary, token to id and id to token, etc. consisting of space groups, atoms, digits, coordinate letters…

I was wondering if it was possible to implement a tokenizer without having to pass by the phase of training one from an existing tokenizer, as my objective doesn’t seem to fit in with the existing ones.

(For info I will then be training a GPT-2 Model from scratch.

Thanks anyone for the help! I hope I was clear.

1 Like

It sounds like you’re on the right track with trying to use your custom tokenizer for training a GPT-2 model from scratch, especially since your data has a specific domain (crystallographic information). You don’t need to train a tokenizer from scratch using an existing tokenizer, but instead, you can directly create a custom tokenizer that fits your vocabulary and tokenization needs. Here’s how you can approach it:

  1. Custom Tokenizer: Since you already have the vocabulary and mappings (token_to_id and id_to_token), you can create a custom tokenizer using PreTrainedTokenizerFast from the Hugging Face library. This class allows you to implement your own tokenization logic without needing to train one from scratch.

    Here’s a rough outline of how you can do this:

    from transformers import PreTrainedTokenizerFast
    
    # Define your custom vocabulary
    vocab = {
        "H": 0,  # Example: Atom H -> Token ID 0
        "O": 1,  # Example: Atom O -> Token ID 1
        "COORD": 2,  # Example: Some crystallographic coordinate -> Token ID 2
        # Add your vocabulary mappings here
    }
    
    # Create a custom tokenizer class by extending PreTrainedTokenizerFast
    class CustomTokenizer(PreTrainedTokenizerFast):
        def __init__(self, vocab):
            super().__init__()
            self.vocab = vocab
            self.id_to_token = {i: token for token, i in vocab.items()}
            self.token_to_id = vocab
    
        def encode(self, text, add_special_tokens=True):
            # Implement custom encoding logic
            return [self.token_to_id.get(token, self.token_to_id.get('<unk>', 0)) for token in text.split()]
        
        def decode(self, tokens, skip_special_tokens=False):
            # Implement custom decoding logic
            return ''.join([self.id_to_token.get(token_id, '<unk>') for token_id in tokens])
    
    # Instantiate your custom tokenizer
    tokenizer = CustomTokenizer(vocab)
    
  2. Tokenization Logic: Since you’re working with crystallographic data, you might need to define specific tokenization rules (e.g., breaking down space groups, atoms, or coordinates into meaningful tokens). The encode and decode methods will allow you to handle the conversion from strings to token IDs and vice versa.

  3. Training GPT-2: Once you’ve implemented the tokenizer, you can use it for training your GPT-2 model from scratch. You can pass your custom tokenizer to the GPT-2 model’s Trainer in Hugging Face, just like you would with any other tokenizer.

    from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments
    
    # Load the GPT-2 configuration
    config = GPT2Config(vocab_size=len(vocab))  # Set vocab_size based on your custom tokenizer
    
    # Initialize GPT-2 model
    model = GPT2LMHeadModel(config)
    
    # Define Trainer and training arguments
    training_args = TrainingArguments(output_dir="./model", per_device_train_batch_size=4)
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,  # Make sure to format your dataset
        tokenizer=tokenizer,
    )
    
    # Start training
    trainer.train()
    

In summary, you don’t need to start from an existing tokenizer but can directly create your own custom one based on your domain-specific vocabulary. The Hugging Face library’s flexibility allows you to extend the tokenizer and model training to suit your needs without going through the tokenizer training phase that is required for some tasks.

1 Like