How does one create a custom hugging face model with a already working tokenizer?

I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let’s say decoder to make it concrete (but both is better).

How does one do this? I found this Create a custom architecture but the tutorial honestly felt/seemed incomplete. (fyi also saw this causal: How to Train a Custom Hugging Face LM for Text Generation? Part A Create a HF dataset from CSV file - YouTube). Is there one with a full end-to-end code example running?

e.g. of what i have right now:

import torch
import torch.nn as nn
from transformers import PreTrainedTokenizer, PreTrainedModel, PretrainedConfig


class MyTokenizer(PreTrainedTokenizer):
    def __init__(self, vocab_file, **kwargs):
        super().__init__(vocab_file, **kwargs)

    def __call__(self, text):
        tokens = text.split()
        token_ids = self.convert_tokens_to_ids(tokens)
        return token_ids


class MyModel(PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
        self.linear = nn.Linear(config.hidden_size, config.num_labels)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, **kwargs):
        embeddings = self.embedding(input_ids)
        pooled = torch.mean(embeddings, dim=1)
        pooled = self.dropout(pooled)
        logits = self.linear(pooled)
        return logits


config = PretrainedConfig(vocab_size=1000, hidden_size=128, num_labels=2, hidden_dropout_prob=0.5)
tokenizer = MyTokenizer("path/to/vocab/file")
model = MyModel(config)

input_ids = tokenizer("This is a test")
logits = model(torch.tensor([input_ids]))

But I feel in a more principled way I think a solution should satisfy the following:

  1. Satisfy the standard HF model API so that the HF trainer, the GPU usage for the data & model and compatible with pytorch data loaders.
  2. The tokenizer is also seamless. At what point do we tokenize? Is inside the model, inside the data loader?
  3. What would be the test it works as HF model works? My guess is 1. works with a custom pytorch training loop 2. it works witha HF trainer

One other thing that would be worth trying is opening up a model e.g. T5 and seeing how it’s implemented and copying the style?


refs: