Tokenizing my novel for GPT model

Hello,

I’m a fiction author and I wanted to finetune a pretrained GPT model to see what would happen if I asked it to write more chapters of my novel (30k words), in my style.

Unfortunately I think I’m getting stuck on tokenizing my novel and would appreciate any help.

I’ve written code to train the novel and it runs up until I call the actual training part:

from transformers import GPTNeoForCausalLM, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import Dataset, load_dataset

# Step 1: Import my novel
import docx
import pandas as pd

# Read each paragraph from a Word file
doc = docx.Document(r"C:\Users\chris\Downloads\The Black Squirrel (1).docx")
paras = [p.text for p in doc.paragraphs if p.text]

# Convert list to dataframe
df = pd.DataFrame(paras)
df.reset_index(drop=False,inplace=True)
df.rename(columns={'index':'label',0:'text'},inplace=True)

# Split my novel into train and test
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.05)

# Export novel as CSV to be read by Huggingface library
train.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv", index=False)
test.to_csv(r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv", index=False)

# Tokenize novel
datasets = load_dataset('csv',
                       data_files={'train':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_train.csv",
                       'test':r"C:\Users\chris\OneDrive\Documents\ML\data\black_squirrel_dataset_test.csv"})

# Instantiate tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("EleutherAI/gpt-neo-1.3B",
                                          pad_token='[PAD]')

# Do I need the below?
# tokenizer.enable_padding(pad_id=tokenizer.token_to_id('[PAD]'))
paragraphs = df['text']
max_length = max([len(tokenizer.encode(paragraphs)) for paragraphs in paragraphs])

# Tokenize my novel
def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True)

tokenized_datasets = datasets.map(tokenize_function, batched=True)

# Step 2: Train the model
model = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-neo-1.3B")

model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir=r"C:\Users\chris\OneDrive\Documents\ML\models",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

But when I train, I get a memory error:

The following columns in the training set don't have a corresponding argument in `GPTNeoForCausalLM.forward` and have been ignored: text. If text are not expected by `GPTNeoForCausalLM.forward`,  you can safely ignore this message.
C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 779
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 75
  Number of trainable parameters = 1315577856
  0%|          | 0/75 [19:12<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\chris\AppData\Local\Programs\Python\Python37\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 9, in <module>
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1547, in train
    ignore_keys_for_eval=ignore_keys_for_eval,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 1791, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2539, in training_step
    loss = self.compute_loss(model, inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\trainer.py", line 2571, in compute_loss
    outputs = model(**inputs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 752, in forward
    return_dict=return_dict,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 627, in forward
    output_attentions=output_attentions,
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 342, in forward
    feed_forward_hidden_states = self.mlp(hidden_states)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\models\gpt_neo\modeling_gpt_neo.py", line 300, in forward
    hidden_states = self.act(hidden_states)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\torch\nn\modules\module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\chris\PycharmProjects\37venv\lib\site-packages\transformers\activations.py", line 35, in forward
    return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
RuntimeError: [enforce fail at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 405798912 bytes.

I successfully fine tuned a pretrained BERT model last night, so I know I should be able to run this. Also, I have 16GB RAM so 405 MB shouldn’t actually be a problem. I am running on CPU, so I know that impacts performance but I feel like here that shouldn’t lead to a memory error.

Since I’m new to tokenization, I feel like maybe my tokens are too big, or I messed up with padding, or other arguments that GPT Neo required. (I can change to any GPT model, but wanted to see GPT Neo since I’ve read it’s newer.)

I’m specifically wondering if the fact that I’m entering in paragraphs is an issue. I write long paragraphs, so I think that the token length (which seemed to be at 387 for my longest paragraphs, unless I did that wrong) could be impacting my memory. I can try to read in sentences, but the code I’ve found to break my novel up in sentences will be a little inaccurate as it breaks up by . which won’t work if the sentence ends with a !, dialogue mark, ellipses, etc.

I would appreciate any help.

For anyone interested in replicating this specifically, here’s a link to the Google Doc of my novel, The Black Squirrel, where you can download as docx.

Thank you!
Christina

1 Like