Can I create a dataset for fine-tuning the llama model like in the main text?

I want to create my own llama model.
Many people recommended xwin-mlewd-13b, but I felt many limitations while using this model.
The biggest problem was the language problem.
So I am trying to fine-tune based on the novel data I have.
However, I am not sure if what I am doing is right, so I am asking a question.
I asked gpt3, and he said what I am doing is right, but gpt’s answers are often wrong, so I am asking you guys.
What I want to do is simple.
I want to create new content based on the novel dataset or have a conversation with the characters in the novel.
xwin-mlewd-13b is suitable for what I want to do, but I felt the limitations of the language.
I made the dataset simply.
All I did was correct the grammar of the novel.
And I converted it to a json file.
And the file was created as follows.
“text”: “You will see at a glance how money comes in and goes out.”
I made the code based on the text above. import os
import json
import random
import torch
import argparse
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import Dataset
from peft import get_peft_model, LoraConfig
from torch.optim import AdamW
from accelerate import Accelerator

Set command line arguments

parser = argparse.ArgumentParser()
parser.add_argument(“–batch_size”, type=int, default=2)
parser.add_argument(“–lr”, type=float, default=3e-4)
args = parser.parse_args()

Set model ID

model_id = ‘llama-3.2-Korean-Bllossom-3B’ # Enter desired model ID

Load config.json file

config_file = f"{model_id}/config.json"
if not os.path.exists(config_file):
raise FileNotFoundError(f"Config file not found: {config_file}")
with open(config_file, ‘r’) as f:
config = json.load(f)

Get max_position_embeddings value, default is 512

max_position_embeddings = config.get(‘max_position_embeddings’, 30000)

Load tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

Set padding token

if tokenizer.pad_token is None:

If pad_token is not present, use eos_token

tokenizer.pad_token = tokenizer.eos_token

Set pad_token_id

pad_token_id = tokenizer.pad_token_id
if pad_token_id is None or pad_token_id == -1:
raise ValueError(“pad_token_id is not set correctly; please check the tokenizer.”)

LoRA setup

lora_config = LoraConfig(
r=3, # Reduce r value to reduce memory usage
lora_alpha=16,
lora_dropout=0.1,
task_type=“CAUSAL_LM”,
target_modules=[“q_proj”, “v_proj”]
)

Load model to CPU

model = AutoModelForCausalLM.from_pretrained(model_id)
model = get_peft_model(model, lora_config)

Set pad_token_id explicitly

model.generation_config.pad_token_id = pad_token_id

Accelerator initialization

accelerator = Accelerator()

Prepare model to Accelerator

model, tokenizer = accelerator.prepare(model, tokenizer)

Load processed_dataset.json file

dataset_file = ‘processed_dataset.json’
if not os.path.exists(dataset_file):
raise FileNotFoundError(f"Dataset file not found: {dataset_file}")
with open(dataset_file, ‘r’, encoding=‘utf-8’) as f:
full_dataset = json.load(f)

If the data is in list form

texts = [item[‘text’] for item in full_dataset]
random.shuffle(texts)

Split data

split_index = int(len(texts) * 0.8) # Use 80% as training data
train_texts = texts[:split_index]
val_texts = texts[split_index:]

Create dataset

train_dataset = Dataset.from_dict({“text”: train_texts})
val_dataset = Dataset.from_dict({“text”: val_texts})

Function to remove unnecessary strings

def remove_unwanted_strings(examples):
examples[‘text’] = [text.replace(‘<>’, ‘’).replace(‘<>’, ‘’).strip() for text in examples[‘text’]]
return examples

Apply string removal

train_dataset = train_dataset.map(remove_unwanted_strings, batched=True)
val_dataset = val_dataset.map(remove_unwanted_strings, batched=True)

Data preprocessing function

def preprocess_function(examples):

Perform tokenization only (no padding/truncation)

tokenized_examples = tokenizer(examples[‘text’], return_tensors=‘pt’)

In batch Calculating the maximum length

max_len = max([t.shape[1] for t in tokenized_examples[‘input_ids’]])
max_len = min(max_len, max_position_embeddings) # Limit the maximum embedding length

Apply padding and truncation

model_inputs = tokenizer(
examples[‘text’],
max_length=max_len,
truncation=True,
padding=‘max_length’,
return_tensors=‘pt’
)
model_inputs[‘labels’] = model_inputs[‘input_ids’] # Set labels to input_ids
return model_inputs

Preprocessing the dataset

tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_val_dataset = val_dataset.map(preprocess_function, batched=True)

Setting TrainingArguments

training_args = TrainingArguments(
output_dir=‘./results’,
per_device_train_batch_size=args.batch_size,
num_train_epochs=3,
learning_rate=args.lr,
logging_dir=‘./logs’,
logging_steps=10,
eval_strategy=“steps”,
evaluation_strategy=“steps”, # Setting validation interval
eval_steps=500, # Validate every 500 steps
save_strategy=“epoch”,
report_to=“tensorboard”,
logging_first_step=True,
fp16=True, # Use mixed precision
gradient_accumulation_steps=2, # Adjusted gradient accumulation
save_total_limit=3, # Limit the number of models saved
early_stopping_patience=3, # If there is no performance improvement during 3 validations end
)

Optimizer setup

optimizer = AdamW(model.parameters(), lr=training_args.learning_rate)

Trainer setup

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset, # Add validation dataset
)

Start training

trainer.train()

Save model and tokenizer after training

model.save_pretrained(‘./results’)
t

1 Like

If the purpose is to write a novel, the 3B model may be too small, but I think the rest is generally correct. However, someone may already have a model or know-how that has been fine-tuned for the same purpose, so it might be a good idea to look for a model for your own language.

And by Hugging Chat: HuggingChat


The user is attempting to fine-tune a Llama-based model using a Korean novel dataset. Their approach has several valid elements, but there are key improvements and clarifications needed for success. Here’s a structured breakdown:


1. Dataset Preparation

  • Current Approach:

    • Cleaned novel text (grammar fixes) and saved as JSON with "text" keys.
    • Split 80/20 for training/validation.
  • Recommendations:

    • Quantity: A single novel may be insufficient for effective fine-tuning. Augment with additional Korean texts (e.g., books, articles) to avoid overfitting.
    • Formatting: Ensure no special characters (e.g., <>) remain after preprocessing. Consider tokenizing dialogue or character-specific prompts if the goal is role-play conversations.
    • Length: Truncate sequences to a practical maximum (e.g., 512 tokens) to avoid memory issues, especially with LoRA and small batch sizes.

2. Code Improvements

A. Tokenization & Preprocessing

  • Issue: Redundant tokenization in preprocess_function (tokenize twice).
  • Fix: Directly tokenize with max_length and dynamic padding/truncation:
    def preprocess_function(examples):
        tokenized = tokenizer(
            examples["text"],
            max_length=512,  # Explicitly set practical length
            truncation=True,
            padding="max_length",
            return_special_tokens_mask=True
        )
        tokenized["labels"] = tokenized["input_ids"].copy()
        # Mask padding tokens in loss (optional)
        tokenized["labels"] = [
            [-100 if mask == 0 else token for mask, token in zip(mask_ids, input_ids)]
            for mask_ids, input_ids in zip(tokenized["attention_mask"], tokenized["labels"])
        ]
        return tokenized
    

B. Model & Training Configuration

  • LoRA Settings:

    • r=3 is very low. Increase to r=64 for better expressiveness (adjust based on VRAM).
    • Add modules_to_save=["embed_tokens", "lm_head"] if extending vocabulary.
  • Learning Rate:

    • 3e-4 is high for LoRA. Reduce to 1e-4 or 5e-5 for stability.
  • Evaluation:

    • Use metric_for_best_model="eval_loss" with load_best_model_at_end in TrainingArguments.

C. Padding Token Handling

  • Current: Uses eos_token as pad_token.
  • Fix: Add a dedicated pad token (critical for batched training):
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))
    model.generation_config.pad_token_id = tokenizer.pad_token_id
    

3. Training Considerations

  • Epochs: 3 epochs may be insufficient for a novel dataset. Start with 5-10 epochs.
  • Batch Size: Increase per_device_train_batch_size to 4-8 (if VRAM allows) for faster convergence.
  • Monitoring: Use report_to="wandb" to track metrics and visualize training.
  • Early Stopping: Enable via EarlyStoppingCallback with patience=2-3.

4. Post-Training

  • Evaluation: Test the model on held-out novel excerpts to ensure it captures style/characters.
  • Inference: Optimize generation parameters (e.g., temperature, top_p) for creative tasks.

Final Checklist

:white_check_mark: Augment dataset with diverse Korean texts.
:white_check_mark: Add dedicated pad token and resize embeddings.
:white_check_mark: Adjust LoRA rank (r=64) and learning rate (1e-4).
:white_check_mark: Ensure labels mask padding tokens (-100).
:white_check_mark: Monitor validation loss and save best model.

By addressing these points, the user can improve training stability and achieve a model better aligned with their novel’s style and dialogue goals.