How to load an old (ko)bert .pt file, and convert the code that was supposed to run the old model into modern transformers/huggingface code

levavft · October 30, 2023, 1:33am

hey everyone, I essentially have code that was written for the old kobert (korean bert) and gluonnlp python libraries. I also have a .pt model file.
I would like to be able to run the trained model using huggingface.
The details:
The training seems to have been done on one of the kobert models (skt/kobert-base-v1 · Hugging Face or maybe monologg/kobert · Hugging Face) with these hyper parameters:
max_len = 512
batch_size = 6
warmup_ratio = 0.1
num_epochs = 20
max_grad_norm = 1
log_interval = 20
learning_rate = 5e-6
num_workers = 2
n_splits = 5
and this category list:
category_list = [“화장품”, “패션”, “요리음식”, “여행아웃도어”, “인테리어”, “엔터테인먼트”, “육아”, “아이티”, “자동차”, “헬스/피트니스”, “반려동물”]
e.g. it has been trained to take a single sentence and return a relevant category.
a dataset class has been defined like this:
from torch.utils.data import Dataset
import gluonnlp as nlp
…
class BERTDataset(Dataset):
def init(self, dataset, bert_tokenizer, max_len,
pad, pair):
transform = nlp.data.BERTSentenceTransform(
bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

    self.sentences = []

    for data in dataset:
        if len(data)<=max_len:
            self.sentences.append(transform([data]))
        else:
            self.sentences.append(transform([data[:max_len]]))
    

def __getitem__(self, i):
    return (self.sentences[i])

def __len__(self):
    return (len(self.sentences))

from what I can tell, beyond the ugly code, this is just equivalent to adding “truncation=True” in a tokenizer, but I’m not 100% sure.

a classifier has been defined which seems to just create the equivalent of ‘token_type_ids’ in a tokenizer:
from torch import nn
…
class BERTClassifier(nn.Module):
def init(self,
bert,
hidden_size = 768,
num_classes=11,
dr_rate=None,
params=None):
super(BERTClassifier, self).init()
self.bert = bert
self.dr_rate = dr_rate

    self.classifier = nn.Linear(hidden_size , num_classes)
    if dr_rate:
        self.dropout = nn.Dropout(p=dr_rate)

def gen_attention_mask(self, token_ids, valid_length):
    attention_mask = torch.zeros_like(token_ids)
    for i, v in enumerate(valid_length):
        attention_mask[i][:v] = 1
    return attention_mask.float()

def forward(self, token_ids, valid_length, segment_ids):
    attention_mask = self.gen_attention_mask(token_ids, valid_length)
    
    _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
    if self.dr_rate:
        out = self.dropout(pooler)
    return self.classifier(out)

and a tokenizer & initial model are loaded:
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model
…
bertmodel, vocab = get_pytorch_kobert_model()

tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)
threshold = 5.26
device = torch.device(“cuda:0”)

model_name=“…\kobertbest_512.pt”
modelbest = torch.load(model_name, map_location=device)
modelbest.to(device)
modelbest.eval()

then the model is used like this:
def GetMediaCategory(captionlist):
for i in range(len(captionlist)):
print(captionlist[i])
print(type(captionlist[i]))
captionlist[i] = unicodedata.normalize(‘NFC’,captionlist[i])
captionlist[i] = ’ ‘.join(re.compile(’[가-힣]+').findall(captionlist[i]))
if len(captionlist[i]) == 0:
captionlist[i] = ‘기타’

datalist = BERTDataset(captionlist, tok, max_len, True, False)
test_dataloader = torch.utils.data.DataLoader(datalist, batch_size=batch_size, num_workers=num_workers)
gc.collect() 
wholeout=[]
wholevalue=[]

for batch_id,(token_ids, valid_length, segment_ids) in enumerate(notebook.tqdm(test_dataloader)):
    token_ids = token_ids.long().to(device)
    segment_ids = segment_ids.long().to(device)
    valid_length= valid_length
    outlist = []
    valuelist = []
    out = modelbest(token_ids, valid_length, segment_ids)
    for outi in out:
        valuelist.append(outi.max().tolist())
        if outi.max().tolist() > threshold:
            outlist.append(categorylist[outi.argmax()])
        else:
            outlist.append('기타')
    wholeout+=outlist
    wholevalue+=valuelist

return wholeout, wholevalue

again, please excuse the shitty code. I’ve refactored it, but since I couldn’t run the original or refactored code, it felt wrong to share undebugged code. presumably this was debugged by the original authors

anyhow, I hope this is enough information to be able to tell how to load and use the trained model, using the huggingface library ecosystem. If more of the original code would be useful, I’ll gladly upload it, I just thought that these pieces might be the most relevant ones, if at all.

Topic		Replies	Views
How to convert TF Checkpoints to sentence embedings Beginners	4	1407	November 27, 2020
Is Google's official BERT model and huggingface BERT model different or same? Beginners	1	988	March 9, 2022
SpanBERT, ELECTRA, MARGE from scratch? Beginners	5	1133	July 22, 2023
BERT Model - OSError Beginners	2	4284	March 1, 2023
How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel? 🤗Datasets	10	2258	October 28, 2022

How to load an old (ko)bert .pt file, and convert the code that was supposed to run the old model into modern transformers/huggingface code

Related Topics