Fine-tune, or train from scratch?

I have a corpus of about 15,000 documents, with a total of about 8gb of text, which I want to use as the source-material for a text generator.

With that much data, would it make sense to use a pre-trained model (like gpt2-large), and then fine-tune it on my corpus? Or would it make more sense to train a new language model from scratch using my data? What would be the trade-offs between those two options?

I know the answer is probably complex, but I’m interested in understanding what considerations to take into account, especially how those two options effect my budget for running the training process in the cloud.

Definitely finetune IMO. Will be much faster and better results. Would you rather teach a third grader how to predict the next word on your dataset or a newborn?
One argument for from scratch would be more control. It is less likely your from scratch model says something racist or otherwise wrong if it is trained on just your (presumably friendly) data.

Gotcha. Good advice all around!

This is the code train a gpt2 from scratch

from transformers import  DataCollatorForLanguageModeling 
from transformers import BertTokenizerFast
from transformers import Trainer, TrainingArguments,GPT2LMHeadModel,GPT2Config

import torch
import os
from torch.utils.data.dataset import Dataset
from transformers.utils import logging
from transformers.tokenization_utils import PreTrainedTokenizer

logger = logging.get_logger(__name__)


class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""

def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
    assert os.path.isfile(file_path), f"Input file path {file_path} not found"
    # Here, we do not cache the features, operating under the assumption
    # that we will soon use fast multithreaded tokenizers from the
    # `tokenizers` repo everywhere =)
    logger.info("Creating features from dataset file at %s", file_path)

    with open(file_path, encoding="utf-8") as f:
        lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

    batch_encoding = tokenizer(lines, add_special_tokens=False, truncation=True, max_length=block_size)
    self.examples = batch_encoding["input_ids"]

def __len__(self):
    return len(self.examples)

def __getitem__(self, i) -> torch.Tensor:
    return torch.tensor(self.examples[i], dtype=torch.long)


tokenizer = BertTokenizerFast(
        vocab_file = r"D:\2020.09.15GPT2\vocab.txt",
        unk_token='<unk>',
        sep_token='<sep>',
        pad_token='<pad>',
        cls_token='</s>',
        mask_token='<mask>') 

special_tokens_dict = {"bos_token": "<s>", "eos_token": "</s>"}
tokenizer.add_special_tokens(special_tokens_dict)
config = GPT2Config.from_pretrained(r'D:\2020.09.15GPT2\config.json')
model = GPT2LMHeadModel(config)
model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size


def load_dataset(train_path,tokenizer):
    train_dataset = LineByLineTextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    
    return train_dataset,data_collator

train_path = r'Seven_Lines_Verse_plus_sign.txt'

train_dataset,data_collator = load_dataset(train_path,tokenizer)


training_args = TrainingArguments(
    output_dir=r"D:\2020.09.15GPT2", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    save_total_limit= 20,
    num_train_epochs=5, # number of training epochs
    per_device_train_batch_size=36, # batch size for training
    per_device_eval_batch_size=36,  # batch size for evaluation
    eval_steps = 1000, # Number of update steps between two evaluations.
    save_steps=1000, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=None,
    prediction_loss_only=True,
)

trainer.train()

trainer.save_model()
3 Likes

The generation:

from transformers import pipeline,BertTokenizer
tokenizer = BertTokenizer.from_pretrained(r"G:\2020.09.07 pytorch_pretrained_models\bert-base-chinese")
nlp = pipeline('text-generation',model=r'D:\2020.09.15GPT2\checkpoint-2000', tokenizer=tokenizer ,config = config)

nlp('漠 漠 水 田')

[{‘generated_text’: ‘漠 漠 水 田 不 可 怜 , 然 一 笑 更 相 逢 。 风 流 自 是 三 千 里 , 雨 落 谁 知 一 百 年 。 万 事 有 时 无 处 处 , 一 生 无 处 有 时 时 。 何 人 莫 问 东 风 月 , 只 有 春 风 与 此 身 。 < / s > 吟 未 能 为 此 地 如 何 必 君 非 吾 今 日 空 在 此 去 何 妨 何 须 多 少 人 间 不 知 何 处 处 处’}]

Hey what is the source for this code? Did you try already? Also how to use multiple GPUs ?

Hi @vikasRajashekar
You can use run_language_modeling.py script from here , to use multiple GPU’s you can use the following template

python -m torch.distributed.launch \
    --nproc_per_node=NUM_GPUS_YOU_HAVE run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE