How does one create a custom hugging face model with a already working tokenizer?

brando · February 21, 2023, 6:15pm

I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let’s say decoder to make it concrete (but both is better).

How does one do this? I found this Create a custom architecture but the tutorial honestly felt/seemed incomplete. (fyi also saw this causal: How to Train a Custom Hugging Face LM for Text Generation? Part A Create a HF dataset from CSV file - YouTube). Is there one with a full end-to-end code example running?

e.g. of what i have right now:

import torch
import torch.nn as nn
from transformers import PreTrainedTokenizer, PreTrainedModel, PretrainedConfig


class MyTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab_file, **kwargs):
        super().__init__(vocab_file, **kwargs)

    def __call__(self, text):
        tokens = text.split()
        token_ids = self.convert_tokens_to_ids(tokens)
        return token_ids


class MyModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        self.linear = nn.Linear(config.hidden_size, config.num_labels)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, **kwargs):
        embeddings = self.embedding(input_ids)
        pooled = torch.mean(embeddings, dim=1)
        pooled = self.dropout(pooled)
        logits = self.linear(pooled)
        return logits


config = PretrainedConfig(vocab_size=1000, hidden_size=128, num_labels=2, hidden_dropout_prob=0.5)
tokenizer = MyTokenizer("path/to/vocab/file")
model = MyModel(config)

input_ids = tokenizer("This is a test")
logits = model(torch.tensor([input_ids]))

But I feel in a more principled way I think a solution should satisfy the following:

Satisfy the standard HF model API so that the HF trainer, the GPU usage for the data & model and compatible with pytorch data loaders.
The tokenizer is also seamless. At what point do we tokenize? Is inside the model, inside the data loader?
What would be the test it works as HF model works? My guess is 1. works with a custom pytorch training loop 2. it works witha HF trainer

One other thing that would be worth trying is opening up a model e.g. T5 and seeing how it’s implemented and copying the style?

refs:

SO: deep learning - How does one create a custom hugging face model that is compatible with the HF trainer? - Stack Overflow

esenergun · July 29, 2024, 10:18pm

Hi Brando, did you ever figure out how to create a custom Tokenizer? I am also struggling with the same thing.

Topic		Replies	Views
Using HF to train a custom PyTorch architecture Beginners	0	511	July 29, 2022
Custom, without any pretraining, training with PyTorch Beginners	0	286	January 30, 2023
Character level tokenizer with specific order 🤗Tokenizers	5	67	February 7, 2025
How to create a Huggingface tokenizer from a non-Huggingface tokenizer? 🤗Tokenizers	0	524	May 4, 2021
Creating custom model Beginners	0	683	June 2, 2021

How does one create a custom hugging face model with a already working tokenizer?

Related topics