Pretrain a model on a very specific language for NER

Hello everyone,

I’m new to the world of NLP and more specifically to the transformers library. As part of a project, I need to train a model to recognize tokens (NER) on a very specific language. So I’d like to pretrain a model from scratch.

I started by training a Tokenizer to get tokens adapted to my language:

# Libraries
import pandas as pd
from tokenizers import Regex
from tokenizers import decoders
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers import normalizers
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.processors import RobertaProcessing
from tokenizers.normalizers import NFD, StripAccents, Replace	

# Normalization pipeline
normalizer = normalizers.Sequence([NFD(), StripAccents(), Replace(Regex(r"\d"), "0")])

# Pre-tokenizer
pre_tokenizer = ByteLevel(add_prefix_space = True)

# Model
tokenizer = Tokenizer(BPE(unk_token = "<unk>"))
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizer

# Post-Processing
tokenizer.post_processor = RobertaProcessing(sep = ("</s>", 2), cls = ("<s>", 0))

# Decoder
tokenizer.decoder = decoders.ByteLevel()

# Trainer
trainer = BpeTrainer(vocab_size = 20000, min_frequency = 2, special_tokens = ["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# Load data
datis = pd.read_csv("../../Data/unlabeled_datis.csv")
data = datis['message'].tolist()

# Train the tokenizer
tokenizer.train_from_iterator(data, trainer = trainer)

# Save the trained tokenizer"datis_tokenizer.json")

And I’ve written some code to pre-train a RoBERTa model using the previous tokenizer:

# Libraries
import pandas as pd
import matplotlib.pyplot as plt
from import Dataset
from transformers import RobertaTokenizerFast
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from transformers import RobertaConfig, RobertaForMaskedLM

# Parameters
model_max_length = 128

# Tokenizer
tokenizer = RobertaTokenizerFast(tokenizer_file = "datis_tokenizer.json")

# Training dataset
class tokenizedDataset(Dataset):
    def __init__(self, filepath):
        self.tokenizer = tokenizer = pd.read_csv(filepath)

    def __len__(self):
        return len(

    def __getitem__(self, ind):
        message =[ind,-1]
        tokenized_text = tokenizer.encode(message, truncation = True, max_length = model_max_length - 2, return_special_tokens_mask = True)
        return {'input_ids': tokenized_text}
training_set = tokenizedDataset("../../Data/unlabeled_datis.csv")

# Data collector
data_collator = DataCollatorForLanguageModeling(tokenizer = tokenizer, mlm = True, mlm_probability = 0.15, return_tensors = "pt")

config = RobertaConfig(vocab_size = tokenizer.vocab_size, max_position_embeddings = model_max_length)
model = RobertaForMaskedLM(config = config)

# Training arguments
training_args = TrainingArguments(output_dir = "./checkpoint_model", overwrite_output_dir = True)

# Training
trainer = Trainer(model = model, args = training_args, data_collator = data_collator, train_dataset = training_set)

# Save the model

For the moment, the code works when I run it on my personal computer (with one cpu only) but before I run it on a more powerful machine (with one or few gpu), I’d like to have answers to a few questions:

  • Is my overall approach correct to do NER on a very specific language ? (I think so since my approach is similar to that of this article).
  • Does my code follow the transformers library best practices? For example, should the get_item method return a dictionary or just the list of tokens?
  • Does the code need to be adapted to be trained on a GPU (with etc)? I have the impression that Trainer takes care of this automatically, but I’d like to be sure.

Sorry if some of the answers can be found in the documentation, I’m trying to make it my own but it’s very dense and thanks for the help !