Fine-tune, or train from scratch?

benjismith · September 14, 2020, 9:48pm

I have a corpus of about 15,000 documents, with a total of about 8gb of text, which I want to use as the source-material for a text generator.

With that much data, would it make sense to use a pre-trained model (like gpt2-large), and then fine-tune it on my corpus? Or would it make more sense to train a new language model from scratch using my data? What would be the trade-offs between those two options?

I know the answer is probably complex, but I’m interested in understanding what considerations to take into account, especially how those two options effect my budget for running the training process in the cloud.

sshleifer · September 15, 2020, 4:15am

Definitely finetune IMO. Will be much faster and better results. Would you rather teach a third grader how to predict the next word on your dataset or a newborn?
One argument for from scratch would be more control. It is less likely your from scratch model says something racist or otherwise wrong if it is trained on just your (presumably friendly) data.

benjismith · September 15, 2020, 5:49am

Gotcha. Good advice all around!

gaochangkuan · September 15, 2020, 8:24am

This is the code train a gpt2 from scratch

from transformers import  DataCollatorForLanguageModeling 
from transformers import BertTokenizerFast
from transformers import Trainer, TrainingArguments,GPT2LMHeadModel,GPT2Config

import torch
import os
from torch.utils.data.dataset import Dataset
from transformers.utils import logging
from transformers.tokenization_utils import PreTrainedTokenizer

logger = logging.get_logger(__name__)


class LineByLineTextDataset(Dataset):
"""
This will be superseded by a framework-agnostic approach
soon.
"""

def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
    assert os.path.isfile(file_path), f"Input file path {file_path} not found"
    # Here, we do not cache the features, operating under the assumption
    # that we will soon use fast multithreaded tokenizers from the
    # `tokenizers` repo everywhere =)
    logger.info("Creating features from dataset file at %s", file_path)

    with open(file_path, encoding="utf-8") as f:
        lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]

    batch_encoding = tokenizer(lines, add_special_tokens=False, truncation=True, max_length=block_size)
    self.examples = batch_encoding["input_ids"]

def __len__(self):
    return len(self.examples)

def __getitem__(self, i) -> torch.Tensor:
    return torch.tensor(self.examples[i], dtype=torch.long)


tokenizer = BertTokenizerFast(
        vocab_file = r"D:\2020.09.15GPT2\vocab.txt",
        unk_token='<unk>',
        sep_token='<sep>',
        pad_token='<pad>',
        cls_token='</s>',
        mask_token='<mask>') 

special_tokens_dict = {"bos_token": "<s>", "eos_token": "</s>"}
tokenizer.add_special_tokens(special_tokens_dict)
config = GPT2Config.from_pretrained(r'D:\2020.09.15GPT2\config.json')
model = GPT2LMHeadModel(config)
model.resize_token_embeddings(len(tokenizer))  # Update the model embeddings with the new vocabulary size


def load_dataset(train_path,tokenizer):
    train_dataset = LineByLineTextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    
    return train_dataset,data_collator

train_path = r'Seven_Lines_Verse_plus_sign.txt'

train_dataset,data_collator = load_dataset(train_path,tokenizer)


training_args = TrainingArguments(
    output_dir=r"D:\2020.09.15GPT2", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    save_total_limit= 20,
    num_train_epochs=5, # number of training epochs
    per_device_train_batch_size=36, # batch size for training
    per_device_eval_batch_size=36,  # batch size for evaluation
    eval_steps = 1000, # Number of update steps between two evaluations.
    save_steps=1000, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=None,
    prediction_loss_only=True,
)

trainer.train()

trainer.save_model()

gaochangkuan · September 15, 2020, 8:25am

The generation:

from transformers import pipeline,BertTokenizer
tokenizer = BertTokenizer.from_pretrained(r"G:\2020.09.07 pytorch_pretrained_models\bert-base-chinese")
nlp = pipeline('text-generation',model=r'D:\2020.09.15GPT2\checkpoint-2000', tokenizer=tokenizer ,config = config)

nlp（'漠 漠 水 田')

[{‘generated_text’: ‘漠漠水田不可怜，然一笑更相逢。风流自是三千里，雨落谁知一百年。万事有时无处处，一生无处有时时。何人莫问东风月，只有春风与此身。 < / s > 吟未能为此地如何必君非吾今日空在此去何妨何须多少人间不知何处处处’}]

vikasRajashekar · September 16, 2020, 7:35am

Hey what is the source for this code? Did you try already? Also how to use multiple GPUs ?

valhalla · September 16, 2020, 9:14am

Hi @vikasRajashekar
You can use run_language_modeling.py script from here , to use multiple GPU’s you can use the following template

python -m torch.distributed.launch \
    --nproc_per_node=NUM_GPUS_YOU_HAVE run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE

Topic		Replies	Views
Finetunning on a new corpus for Conditional Generation. Should I train from scratch? Models	0	322	February 21, 2023
Train gpt-2 from scratch in Italian Beginners	0	880	September 8, 2022
Loading finetuned model to generate text 🤗Transformers	12	3315	August 7, 2023
How to train gpt-2 from scratch? (no fine-tuning) Beginners	17	19098	December 14, 2022
Training a language model from scratch with tensorflow (not pytorch)? Intermediate	4	860	August 9, 2021

Fine-tune, or train from scratch?

Related topics