Would you like to connect with me for a disscusion on this please ?
Sure. This is my email id (and linkedin below)
shekar.ramamurthy@gmail.com
Sure. Email in profile.
If you want to keep adding new data, I donât think fine-tuning makes a lot of sense. Youâre much better off with a RAG pipeline: Any knowledge you add is then instantly available in your systemâs âknowledgeâ.
50 PDFs are plenty. Not for pretraining, of course. Not for fine-tuning either (unless style emulation is all youâre after). But if you want to make a bot that answers based on the ideas/facts/knowledge inside them, using RAG and an additional augmentation step like I described in my previous post.
P.S. I also suspect that fine-tuning on knowledge often makes a model dumber. The reason is that youâre never really teaching the model âfactsâ. Youâre teaching it specific sequences of tokens, i.e. specific ways to phrase answers. This also means that youâre also teaching your model that every other way of answering the question is wrong.
Which is, of course, mistaken. Because the correct answer to a question can be phrased and formatted in many different forms. So when you fine-tune, you often inadvertently punish the model for perfectly good answers that you simply didnât think of when building your training set. Itâs a bit like sending Albert Einstein to the military, hitting him over the head for each word he utters until all he will do is say âYes, sir!â. You achieved compliance, for sure â but at what cost?
Thank you so much, @leobg, for clearing my doubt.
Hi there, Iâm also facing the exact same issue, would greatly appreciate it if you could share some reliable resources or guidance.
If the problem is extracting data from the PDF itself in a structured manner, this might be helpful. Adobe has their own PDF extract API that outputs JSON. In particular, the documentation mentions the service extracts text, and tables.
Iâve linked to the docs below:
https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/
Edti: Adding a helpful reddit post about Adobe PDF extract API for pdf parsing also linked below:
https://www.reddit.com/r/LocalLLaMA/comments/1anaooi/using_adobe_pdf_extract_api_for_pdf_parsing/
Hi i make an example of rag system like your told. My datas are arround 1000 pages and it includes images, tables, text. I changed my project llama3.2 to deepseek project. I want discuss some topics about this project. I cant find your email on your profile. demirbagalper1@gmail.com is my gmail adress . Can u mail me ? if you send me mail ı can send you full codes and datas (it is my personel project) Have a good day ![]()
Hi @imvbhuvan I am trying to implement a similar solution - i.e. train an existing medium sized LLM on some documents internal to my organisation. I copied all the text in word/PDF documents into plain text file, loaded them as datasets with the huggingface load_dataset API, and trained the model. I tested the model before and after fine tuning, and evidently, the model produces different texts before and after fine tuning - the later being more inline with my documents. However, the generated text still isnât quite useful.
Having read this thread, I am now confused as to whether this approach is correct or not.
Anyway, the first thing I need to accomplish is not just copy text from word/PDF document but to extract meaningfule text only, i.e. remove headers, etc . Till now, I have not been able to extract useful text.
I believe this is what you had asked in the original question. If so, can you please help me in achieving this please?
Thank you so much. Read this today only.
Hi @wanderingdeveloper71 @sabber
Seeking Advice: Fine-Tuning LLMs for Complex Document-Based QA Tasks
Iâm working on fine-tuning a language model using a dataset derived from unstructured documents (e.g., technical guides or regulatory manuals). My current approach involves extracting paragraphs from the source PDFs and prompting an LLM to generate multiple Q&A pairs per section. While this method is scalable and has helped me build a sizable dataset (e.g., 500,000+ Q&A pairs), it has a major limitation:
The generated answers are constrained to the context of the input section.
This becomes problematic when the correct answer to a question requires referencing multiple sections or tables across the document. For example, answering a question about the conditional usage of a specific variable might require synthesizing information from several chapters, appendices, and rule tables. The model, trained on isolated Q&A pairs, struggles to generalize or reason across sections.
Context: the pdf contains information about rules and regulations that need to be followed while creating a dataset for public use case. basically, think of it as a compliance related module.
My Goal
To fine-tune a model that can:
1.Understand and answer general questions about the document in context of violation, validation, compliance of rules as mentioned in the training pdf.
2.Reference and synthesize information from multiple parts of the document because in real life a human would answer from 5 tables and 10 different pages to curate one answer.
3.Provide accurate and contextually rich responses, similar to how a human expert would.
Challenges
1.Context window limitations during training data generation.
2.Lack of cross-sectional reasoning in the training samples.
3.Risk of overfitting to shallow Q&A patterns.
What Iâm Exploring
1.Are there alternative data creation strategies that can better capture cross-sectional dependencies apart from Q&A pairs?
2.Should I consider multi-hop QA generation or document-level summarization as part of the dataset? By multi hop generation or summary level also, very very rare chances are there that the LLM will get to address the most relevant chunks of data that will actually make a more meaningful Q&A pair.
P.S.: I know RAG (Retrieval-Augmented Generation) is a more suitable approach for this use case but am exploring this above problem!
Iâd love to hear from others whoâve tackled similar problems. How did you structure your dataset to enable deep document understanding? Any tips or frameworks you recommend?
you have this model thatâs finetuned to extract metadata from pdf.. what do you train the llm on, i mean the data for the finetuning, some synthetic data?
also, i want to try out this model for creating data from some pdfs⊠so pdf in , dataset is created.. is your model supporting this ? I send a linkedin message .
hi @nielsr
all your notebook is showing invalid
Invalid Notebook
There was an error rendering your Notebook: the âstateâ key is missing from âmetadata.widgetsâ. Add âstateâ to each, or remove âmetadata.widgetsâ.
Using nbformat v5.10.4 and nbconvert v7.16.6
if you can check this
# File: tools/pdf_peft_runner.py
"""
Interactive runner: PDF â chunks â SFT â QLoRA fine-tune â optional merge.
Now with:
- OCR fallback for scanned PDFs (Tesseract + pytesseract + Pillow)
- Project scaffolder: sample PDFs + Makefile
Usage:
python tools/pdf_peft_runner.py # interactive wizard
RUN_ALL=1 python tools/pdf_peft_runner.py # non-interactive run-all with saved/default config
# After scaffolding:
make run # runs RUN_ALL
Install (ingest + OCR only):
pip install pymupdf pillow pytesseract
Install (training/merge too):
pip install torch transformers datasets peft bitsandbytes accelerate
Requires Tesseract binary for OCR (e.g., `sudo apt-get install tesseract-ocr`).
"""
from __future__ import annotations
import os
import re
import io
import json
import math
import glob
import hashlib
from dataclasses import dataclass, asdict
from typing import Iterable, List, Dict, Any, Optional
# Lazy imports for light startup
try:
import fitz # PyMuPDF
except Exception:
fitz = None # surfaced when needed
# ----------------------------
# Config
# ----------------------------
@dataclass
class Config:
# Paths
pdf_dir: str = "./pdfs"
artifacts_dir: str = "./artifacts"
chunks_jsonl: str = "./artifacts/chunks.jsonl"
sft_jsonl: str = "./artifacts/sft.jsonl"
lora_output_dir: str = "./artifacts/mistral-lora"
merged_output_dir: str = "./artifacts/mistral-merged"
# Model
base_model: str = "mistralai/Mistral-7B-Instruct-v0.2"
# Ingest
min_chars: int = 200
target_tokens: int = 900
overlap_tokens: int = 120
# SFT
task: str = "summarize" # or "qa"
# Train
batch: int = 2
accum: int = 8
epochs: int = 2
lr: float = 2e-4
block: int = 2048
lora_r: int = 16
lora_alpha: int = 32
lora_dropout: float = 0.05
# OCR
ocr_enabled: bool = True
ocr_lang: str = "eng"
ocr_dpi: int = 300 # higher â better OCR but slower
@staticmethod
def load_or_default(path: str) -> "Config":
if os.path.exists(path):
with open(path, "r", encoding="utf-8") as f:
data = json.load(f)
return Config(**data)
return Config()
def save(self, path: str) -> None:
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(asdict(self), f, indent=2)
# ----------------------------
# Utils
# ----------------------------
def ensure_dir(path: str) -> None:
os.makedirs(path, exist_ok=True)
def ensure_parent(path: str) -> None:
os.makedirs(os.path.dirname(path), exist_ok=True)
def file_md5(path: str) -> str:
h = hashlib.md5()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(1 << 20), b""):
h.update(chunk)
return h.hexdigest()
def norm_text(s: str) -> str:
# Why: PDFs often contain hard breaks/hyphenation â hurts downstream quality.
s = s.replace("\r", "\n")
s = re.sub(r"(\w)-\n(\w)", r"\1\2", s)
s = re.sub(r"[ \t]+\n", "\n", s)
s = re.sub(r"\n{2,}", "\n\n", s)
s = re.sub(r"[ \t]{2,}", " ", s)
return s.strip()
def approx_token_count(s: str) -> int:
return max(1, math.ceil(len(s) / 4))
def sliding_chunks(text: str, target_tokens: int = 900, overlap_tokens: int = 120) -> List[str]:
if not text:
return []
ratio = 4
target_chars = target_tokens * ratio
overlap_chars = overlap_tokens * ratio
chunks = []
i, n = 0, len(text)
while i < n:
j = min(n, i + target_chars)
window = text[i:j]
m = re.search(r"(?s).*[\.!?]\s+", window)
if m and (i + m.end()) > i + target_chars * 0.6:
j = i + m.end()
chunk = text[i:j].strip()
if chunk:
chunks.append(chunk)
if j >= n:
break
i = max(0, j - overlap_chars)
return chunks
def jsonl_write(path: str, rows: Iterable[Dict[str, Any]]) -> None:
ensure_parent(path)
with open(path, "w", encoding="utf-8") as f:
for r in rows:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
def jsonl_stream(path: str) -> Iterable[Dict[str, Any]]:
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
def yesno(msg: str, default_yes: bool = True) -> bool:
default = "Y/n" if default_yes else "y/N"
ans = input(f"{msg} [{default}]: ").strip().lower()
if not ans:
return default_yes
return ans.startswith("y")
def prompt(msg: str, default: Optional[str] = None) -> str:
tip = f" [{default}]" if default is not None else ""
val = input(f"{msg}{tip}: ").strip()
return val or (default or "")
def prompt_int(msg: str, default: int) -> int:
val = prompt(msg, str(default))
try:
return int(val)
except Exception:
return default
def prompt_float(msg: str, default: float) -> float:
val = prompt(msg, str(default))
try:
return float(val)
except Exception:
return default
# ----------------------------
# OCR helpers
# ----------------------------
def have_ocr() -> bool:
try:
import pytesseract # noqa: F401
from pytesseract import get_tesseract_version
_ = get_tesseract_version() # ensures binary is available
return True
except Exception:
return False
def ocr_page_with_tesseract(page: "fitz.Page", dpi: int, lang: str) -> str:
# Why: Fallback when text extraction returns too little (image-only PDFs).
try:
from PIL import Image
import pytesseract
except Exception:
return ""
zoom = dpi / 72.0
mat = fitz.Matrix(zoom, zoom)
pix = page.get_pixmap(matrix=mat)
png_bytes = pix.tobytes("png")
img = Image.open(io.BytesIO(png_bytes))
try:
text = pytesseract.image_to_string(img, lang=lang, config="--psm 6")
except Exception:
text = ""
return norm_text(text)
# ----------------------------
# Steps
# ----------------------------
def extract_and_chunk(cfg: Config) -> int:
if fitz is None:
raise RuntimeError("PyMuPDF not installed. Install: pip install pymupdf")
pdfs = sorted(glob.glob(os.path.join(cfg.pdf_dir, "**", "*.pdf"), recursive=True))
if not pdfs:
raise RuntimeError(f"No PDFs found in {cfg.pdf_dir}")
use_ocr = cfg.ocr_enabled and have_ocr()
if cfg.ocr_enabled and not use_ocr:
print("[warn] OCR enabled but pytesseract or Tesseract binary not found; continuing without OCR.")
rows: List[Dict[str, Any]] = []
for pdf_path in pdfs:
try:
doc = fitz.open(pdf_path)
except Exception as e:
print(f"[warn] cannot open {pdf_path}: {e}")
continue
base = os.path.basename(pdf_path)
signature = file_md5(pdf_path)
for pi in range(len(doc)):
try:
page = doc[pi]
text = page.get_text("text")
except Exception:
text = ""
text = norm_text(text)
if (not text or len(text) < cfg.min_chars) and use_ocr:
ocr_text = ocr_page_with_tesseract(page, cfg.ocr_dpi, cfg.ocr_lang)
if len(ocr_text) > len(text):
text = ocr_text
if not text or len(text) < cfg.min_chars:
continue
chunks = sliding_chunks(text, cfg.target_tokens, cfg.overlap_tokens)
for ci, chunk in enumerate(chunks):
if len(chunk) < cfg.min_chars:
continue
rows.append({
"doc_file": base,
"doc_md5": signature,
"page": pi + 1,
"chunk_id": f"{signature[:8]}_{pi+1}_{ci+1}",
"text": chunk,
})
doc.close()
if not rows:
raise RuntimeError("No extractable text. If PDFs are scanned and OCR is off/missing, enable OCR and install Tesseract.")
jsonl_write(cfg.chunks_jsonl, rows)
print(f"[ingest] wrote {len(rows)} chunks â {cfg.chunks_jsonl}")
return len(rows)
def build_sft_dataset(cfg: Config) -> int:
rows_out = []
for r in jsonl_stream(cfg.chunks_jsonl):
context = r["text"].strip()
if cfg.task == "summarize":
instr = (
"You are a helpful assistant. Using ONLY the context below, write a concise summary "
"capturing key points, definitions, numbers, and any procedures.\n\n"
f"Context:\n{context}\n"
)
target = context # self-supervised target
else:
instr = (
"Answer the question using ONLY the context below.\n\n"
f"Context:\n{context}\n\n"
"Question: What does this section explain?"
)
target = context
prompt_text = f"<s>[INST] {instr} [/INST] {target} </s>"
rows_out.append({"text": prompt_text})
if not rows_out:
raise RuntimeError("No SFT rows produced. Did you run ingest?")
jsonl_write(cfg.sft_jsonl, rows_out)
print(f"[sft] wrote {len(rows_out)} samples â {cfg.sft_jsonl}")
return len(rows_out)
def train_lora(cfg: Config) -> None:
try:
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig,
DataCollatorForLanguageModeling, Trainer, TrainingArguments)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
except Exception as e:
raise RuntimeError(
"Training deps missing. Install: pip install torch transformers datasets peft bitsandbytes accelerate"
) from e
ensure_dir(cfg.lora_output_dir)
ds = load_dataset("json", data_files=cfg.sft_jsonl, split="train")
tok = AutoTokenizer.from_pretrained(cfg.base_model, use_fast=True)
if tok.pad_token is None:
tok.pad_token = tok.eos_token
tok.padding_side = "right"
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
cfg.base_model,
quantization_config=bnb,
device_map="auto",
trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)
lora = LoraConfig(
r=cfg.lora_r,
lora_alpha=cfg.lora_alpha,
lora_dropout=cfg.lora_dropout,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora)
def tok_fn(batch):
return tok(batch["text"], truncation=True, max_length=cfg.block)
ds_tok = ds.map(tok_fn, batched=True, remove_columns=[c for c in ds.column_names if c != "text"])
collator = DataCollatorForLanguageModeling(tokenizer=tok, mlm=False)
steps_per_epoch = max(1, len(ds_tok) // max(1, (cfg.batch * cfg.accum)))
save_steps = max(50, steps_per_epoch)
args = TrainingArguments(
output_dir=cfg.lora_output_dir,
per_device_train_batch_size=cfg.batch,
gradient_accumulation_steps=cfg.accum,
learning_rate=cfg.lr,
num_train_epochs=cfg.epochs,
logging_steps=10,
save_steps=save_steps,
save_total_limit=2,
bf16=True,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.0,
gradient_checkpointing=True,
report_to="none",
optim="paged_adamw_8bit",
max_grad_norm=1.0,
)
from transformers.trainer_utils import get_last_checkpoint
last = get_last_checkpoint(cfg.lora_output_dir) if os.path.isdir(cfg.lora_output_dir) else None
if last:
print(f"[train] resuming from {last}")
trainer = Trainer(model=model, args=args, train_dataset=ds_tok, data_collator=collator)
trainer.train(resume_from_checkpoint=last)
trainer.save_model(cfg.lora_output_dir)
tok.save_pretrained(cfg.lora_output_dir)
print(f"[train] adapter saved â {cfg.lora_output_dir}")
def merge_lora(cfg: Config) -> None:
try:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
except Exception as e:
raise RuntimeError("Install training deps to merge: transformers peft torch") from e
ensure_dir(cfg.merged_output_dir)
tok = AutoTokenizer.from_pretrained(cfg.base_model, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(cfg.base_model, torch_dtype="auto", device_map="auto")
peft_model = PeftModel.from_pretrained(base, cfg.lora_output_dir)
merged = peft_model.merge_and_unload()
merged.save_pretrained(cfg.merged_output_dir)
tok.save_pretrained(cfg.merged_output_dir)
print(f"[merge] merged model saved â {cfg.merged_output_dir}")
# ----------------------------
# Scaffolder
# ----------------------------
def scaffold_sample_project(cfg: Config) -> None:
"""Create pdfs/, artifacts/, Makefile, requirements.txt and two PDFs (text + scanned)."""
ensure_dir(cfg.pdf_dir)
ensure_dir(cfg.artifacts_dir)
# sample_text.pdf: selectable text
if fitz is None:
raise RuntimeError("PyMuPDF is required to scaffold. Install: pip install pymupdf")
text_pdf_path = os.path.join(cfg.pdf_dir, "sample_text.pdf")
if not os.path.exists(text_pdf_path):
doc = fitz.open()
page = doc.new_page()
sample_text = (
"Sample Project â Text PDF\n\n"
"This PDF contains real text (not an image).\n"
"Use it to verify non-OCR extraction and chunking.\n\n"
"Key Points:\n"
"- Fine-tuning with QLoRA on Mistral.\n"
"- Chunk size â 900 tokens with overlap.\n"
"- JSONL output used for SFT training.\n"
)
page.insert_text(fitz.Point(72, 72), sample_text, fontsize=12)
doc.save(text_pdf_path)
doc.close()
print(f"[scaffold] wrote {text_pdf_path}")
# sample_scanned.pdf: image-only page â forces OCR
scanned_pdf_path = os.path.join(cfg.pdf_dir, "sample_scanned.pdf")
if not os.path.exists(scanned_pdf_path):
try:
from PIL import Image, ImageDraw, ImageFont
except Exception as e:
raise RuntimeError("Pillow is required to scaffold scanned PDF. Install: pip install pillow") from e
# build image
img = Image.new("RGB", (1654, 2339), "white") # ~ A4 @ 150dpi
draw = ImageDraw.Draw(img)
text = (
"Sample Project â Scanned PDF (Image Only)\n\n"
"This page is rendered as an image to simulate a scanned document.\n"
"If OCR is working, the pipeline should extract this text from the image."
)
try:
font = ImageFont.truetype("DejaVuSans.ttf", 28)
except Exception:
font = ImageFont.load_default()
draw.multiline_text((80, 120), text, fill="black", font=font, spacing=8)
# save to PDF using PyMuPDF
img_bytes = io.BytesIO()
img.save(img_bytes, format="PNG")
img_bytes.seek(0)
doc2 = fitz.open()
page2 = doc2.new_page(width=595, height=842) # A4 pt
rect = fitz.Rect(40, 60, 555, 782)
page2.insert_image(rect, stream=img_bytes.getvalue(), keep_proportion=True)
doc2.save(scanned_pdf_path)
doc2.close()
print(f"[scaffold] wrote {scanned_pdf_path}")
# Makefile
mk_path = "Makefile"
if not os.path.exists(mk_path):
makefile = f"""# Auto-generated by pdf_peft_runner.py
PY ?= python
RUN_ALL ?= 1
run:
\t@RUN_ALL=$(RUN_ALL) $(PY) tools/pdf_peft_runner.py
wizard:
\t@$(PY) tools/pdf_peft_runner.py
.PHONY: run wizard
"""
with open(mk_path, "w", encoding="utf-8") as f:
f.write(makefile)
print(f"[scaffold] wrote {mk_path}")
# requirements.txt (optional, helpful)
req_path = "requirements.txt"
if not os.path.exists(req_path):
req = "\n".join([
"pymupdf>=1.23",
"pillow>=10.0",
"pytesseract>=0.3.10",
"torch>=2.2",
"transformers>=4.42",
"datasets>=2.19",
"peft>=0.11",
"bitsandbytes>=0.43.1",
"accelerate>=0.33",
]) + "\n"
with open(req_path, "w", encoding="utf-8") as f:
f.write(req)
print(f"[scaffold] wrote {req_path}")
print("[scaffold] done. Tip: run `make run` or `make wizard`.")
# ----------------------------
# Wizard
# ----------------------------
def show_config(cfg: Config) -> None:
print("\nCurrent configuration:")
for k, v in asdict(cfg).items():
print(f" {k}: {v}")
print("")
def edit_config(cfg: Config) -> Config:
print("Edit config (Enter keeps defaults)")
cfg.pdf_dir = prompt("PDF directory", cfg.pdf_dir)
cfg.artifacts_dir = prompt("Artifacts directory", cfg.artifacts_dir)
cfg.chunks_jsonl = prompt("Chunks JSONL", cfg.chunks_jsonl)
cfg.sft_jsonl = prompt("SFT JSONL", cfg.sft_jsonl)
cfg.lora_output_dir = prompt("LoRA output dir", cfg.lora_output_dir)
cfg.merged_output_dir = prompt("Merged model dir", cfg.merged_output_dir)
cfg.base_model = prompt("Base model repo", cfg.base_model)
cfg.min_chars = prompt_int("Min chars per chunk", cfg.min_chars)
cfg.target_tokens = prompt_int("Target tokens per chunk", cfg.target_tokens)
cfg.overlap_tokens = prompt_int("Overlap tokens", cfg.overlap_tokens)
task = prompt("SFT task (summarize|qa)", cfg.task)
cfg.task = task if task in {"summarize", "qa"} else cfg.task
cfg.batch = prompt_int("Per-device batch", cfg.batch)
cfg.accum = prompt_int("Grad accumulation", cfg.accum)
cfg.epochs = prompt_int("Epochs", cfg.epochs)
cfg.lr = prompt_float("Learning rate", cfg.lr)
cfg.block = prompt_int("Max sequence length", cfg.block)
cfg.lora_r = prompt_int("LoRA r", cfg.lora_r)
cfg.lora_alpha = prompt_int("LoRA alpha", cfg.lora_alpha)
cfg.lora_dropout = float(prompt("LoRA dropout", str(cfg.lora_dropout)) or cfg.lora_dropout)
cfg.ocr_enabled = yesno("Enable OCR fallback for scanned PDFs?", cfg.ocr_enabled)
cfg.ocr_lang = prompt("OCR language (Tesseract lang code)", cfg.ocr_lang)
cfg.ocr_dpi = prompt_int("OCR rasterization DPI", cfg.ocr_dpi)
return cfg
def run_all(cfg: Config, save_cfg_path: str) -> None:
ensure_dir(cfg.artifacts_dir)
cfg.save(save_cfg_path)
print("== Step 1/3: Ingest PDFs ==")
extract_and_chunk(cfg)
print("== Step 2/3: Build SFT dataset ==")
build_sft_dataset(cfg)
print("== Step 3/3: Train QLoRA ==")
train_lora(cfg)
if yesno("Merge adapter into base for a single model?", default_yes=False):
merge_lora(cfg)
def menu(cfg_path: str) -> None:
cfg = Config.load_or_default(cfg_path)
ensure_dir(os.path.dirname(cfg_path))
while True:
show_config(cfg)
print("Choose an action:")
print(" 1) Run ALL (ingest â sft â train)")
print(" 2) Ingest PDFs only")
print(" 3) Build SFT only")
print(" 4) Train QLoRA only")
print(" 5) Merge LoRA into base")
print(" 6) Edit & Save config")
print(" 7) Scaffold sample project (PDFs + Makefile)")
print(" 8) Quit")
choice = input("Select [1]: ").strip() or "1"
try:
if choice == "1":
run_all(cfg, cfg_path)
elif choice == "2":
extract_and_chunk(cfg)
elif choice == "3":
build_sft_dataset(cfg)
elif choice == "4":
train_lora(cfg)
elif choice == "5":
merge_lora(cfg)
elif choice == "6":
cfg = edit_config(cfg)
cfg.save(cfg_path)
print("[ok] config saved.")
elif choice == "7":
scaffold_sample_project(cfg)
elif choice == "8":
return
else:
print("Invalid choice.")
except Exception as e:
print(f"[error] {e}")
def main():
cfg_path = "./artifacts/config.json"
if os.environ.get("RUN_ALL"):
cfg = Config.load_or_default(cfg_path)
run_all(cfg, cfg_path)
else:
menu(cfg_path)
if __name__ == "__main__":
main()
Reply generated by TD Ai
For finetuning we use metadata from various publications, see NatLibFi/FinGreyLitâs README.md:
This repository contains a data set of curated Dublin Core style ground truth metadata from a selection of Finnish âgrey literatureâ publications, along with links to the PDF publications. The dataset is mainly intended to enable and facilitate the development of automated methods for metadata extraction from PDF files, including but not limited to the use of large language models (LLMs).
The publications have been sampled from various DSpace based open repository systems administered by the National Library of Finland. The dataset is trilingual, containing publications in Finnish, Swedish and English language.
All the publication PDF files are openly accessible from the original DSpace systems. Due to copyright concerns, this repository contains only the curated metadata and links to the original PDF files. The repository contains scripts for downloading the PDF publications from the original repositories and extracting the full text.
The preprocessing pipeline scripts/notebooks are at conversion/ directory of the repository.
The models created by finetuning with these data are intended for extracting meatadata like this from PDFs, so you see they are for libarary use, which might not be what you want. Anyway, the newest model is here: NatLibFi/gemma-3-4b-it-GreyLitLM-GGUF · Hugging Face