I want to create a new hugging face (HF) architecture with some existing tokenizer (any one that is excellent is fine). Let’s say decoder to make it concrete (but both is better).
How does one do this? I found this Create a custom architecture but the tutorial honestly felt/seemed incomplete. (fyi also saw this causal: How to Train a Custom Hugging Face LM for Text Generation? Part A Create a HF dataset from CSV file - YouTube). Is there one with a full end-to-end code example running?
e.g. of what i have right now:
import torch
import torch.nn as nn
from transformers import PreTrainedTokenizer, PreTrainedModel, PretrainedConfig
class MyTokenizer(PreTrainedTokenizer):
def __init__(self, vocab_file, **kwargs):
super().__init__(vocab_file, **kwargs)
def __call__(self, text):
tokens = text.split()
token_ids = self.convert_tokens_to_ids(tokens)
return token_ids
class MyModel(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
self.linear = nn.Linear(config.hidden_size, config.num_labels)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, input_ids, **kwargs):
embeddings = self.embedding(input_ids)
pooled = torch.mean(embeddings, dim=1)
pooled = self.dropout(pooled)
logits = self.linear(pooled)
return logits
config = PretrainedConfig(vocab_size=1000, hidden_size=128, num_labels=2, hidden_dropout_prob=0.5)
tokenizer = MyTokenizer("path/to/vocab/file")
model = MyModel(config)
input_ids = tokenizer("This is a test")
logits = model(torch.tensor([input_ids]))
But I feel in a more principled way I think a solution should satisfy the following:
- Satisfy the standard HF model API so that the HF trainer, the GPU usage for the data & model and compatible with pytorch data loaders.
- The tokenizer is also seamless. At what point do we tokenize? Is inside the model, inside the data loader?
- What would be the test it works as HF model works? My guess is 1. works with a custom pytorch training loop 2. it works witha HF trainer
One other thing that would be worth trying is opening up a model e.g. T5 and seeing how it’s implemented and copying the style?
refs: