Question: Which open-source model is best for pruning with 32GB RAM?

Hi everyone :waving_hand:,

I’m currently working on building a Small Language Model (SLM) for structured text parsing and natural language understanding.
My system specifications are:

  • RAM: 32 GB

I’d like to know:

  1. What is the recommended parameter range (in billions) for models that can run efficiently with 32GB RAM (CPU or GPU)?

  2. Which open-source models are suitable for this setup (for example, Mistral, LLaMA, Phi, TinyLlama, etc.)?

  3. Any deployment or optimization tips (like quantization, LoRA, or gguf format for llama.cpp)?

  4. If possible, please suggest tools or frameworks for fine-tuning and evaluation on low-resource systems.

Thanks in advance for your help

1 Like

Hi everyone,

I’m working on building a Small Language Model (SLM) for life insurance document processing — mainly tasks like:

  • OCR and extraction from scanned policies and forms

  • Policy and customer data understanding

  • Claim document analysis

Our current environment is limited to a 32 GB CPU server (no GPU available).

My plan is to:

  1. Select a base model suitable for OCR and document understanding

  2. Prune / quantize it to make it CPU-friendly without losing much accuracy

I’m looking for advice on:

  • Which vision-language / OCR models are most suitable for this scenario (Qwen2.5-VL, InternVL3.5, Surya OCR, etc.)?

  • Recommended LLMs for insurance text understanding that can be pruned or quantized to fit in 32 GB RAM

  • Practical tips for pruning, quantization, or CPU optimization for document-heavy workflows

Any suggestions, model combinations, or workflow examples would be greatly appreciated!

Thank you :folded_hands:

1 Like

On a 32 GB CPU-only box, run 3–8B models quantized (GGUF) via llama.cpp or Ollama. Don’t use a VLM for OCR do OCR → layout → RAG → small instruct LLM. Fine-tune with LoRA on a rented GPU, then merge + quantize and run on CPU.


What fits in 32 GB (CPU, GGUF)

  • 3–4B (Q5/Q6) ≈ 2–3 GB model; fast on CPU.

  • 7–8B (Q4_K_M) ≈ 5–6 GB model; best quality/latency trade-off.

  • 13B (Q4_K_M) ≈ 9–11 GB; slower but still fits.

  • Keep context 2–4k; large context inflates KV cache.

Good open models (CPU-friendly)

  • Phi-3.5-mini-instruct (3.8B)

  • Qwen2.5-3B/7B-Instruct

  • Llama-3.1-8B-Instruct

  • Mistral-7B-Instruct

For document processing (forms, policies, claims)

  • OCR & layout (CPU): Tesseract or PaddleOCR; pdfplumber/PyMuPDF for digital PDFs; layoutparser/doctr if you need zones.

  • RAG embeddings (CPU): bge-small-en-v1.5 / e5-small-v2 / all-MiniLM-L6-v2 + FAISS.

  • Pattern tasks: Add light rules/regex after LLM to extract fields reliably.

Fine-tuning

  • Do LoRA on a GPU box (Axolotl / Llama-Factory / Unsloth).

  • Merge → quantize (GGUF Q4_K_M) → serve on CPU. Full model training on CPU is not practical.


Minimal CPU RAG pipeline (paste-ready)

Plan

  1. Extract text (PDF→text or OCR).

  2. Chunk → embed (CPU) → FAISS index.

  3. Retrieve top-k, build prompt, call local CPU LLM (Ollama).

  4. Evaluate with Recall@k/MRR if desired.

file: cpu_rag_min.py

“”"
CPU-only doc QA with Ollama + Sentence-Transformers + FAISS.
Why: Fits 32 GB RAM, no GPU required.
“”"

import argparse, glob, json, re
from pathlib import Path
from typing import List, Tuple
import numpy as np, requests

def load_texts(glob_pat: str) → List[Tuple[str, str]]:
try:
import fitz # PyMuPDF
has_pdf = True
except Exception:
fitz, has_pdf = None, False
out =
for p in glob.glob(glob_pat):
path = Path(p)
if path.suffix.lower() == “.pdf” and has_pdf:
doc = fitz.open(str(path))
out.append((path.name, “\n”.join(page.get_text() for page in doc)))
elif path.suffix.lower() in {“.txt”, “.md”}:
out.append((path.name, path.read_text(encoding=“utf-8”, errors=“ignore”)))

For image-only PDFs, run OCR separately to .txt and include here.

return out

def chunk(text: str, max_words: int = 350) → List[str]:
sents = re.split(r’(?<=[.!?])\s+', text)
chunks, cur, n = , , 0
for s in sents:
w = len(s.split())
if n + w > max_words and cur:
chunks.append(" “.join(cur)); cur, n = [s], w
else:
cur.append(s); n += w
if cur: chunks.append(” ".join(cur))
return chunks

def build_corpus(files: List[Tuple[str, str]]) → Tuple[List[str], List[str]]:
ids, texts = ,
for fname, txt in files:
for i, c in enumerate(chunk(txt)):
ids.append(f"{fname}#chunk{i}"); texts.append(c)
return ids, texts

def embed(texts: List[str], model_name=“sentence-transformers/all-MiniLM-L6-v2”) → np.ndarray:
from sentence_transformers import SentenceTransformer
m = SentenceTransformer(model_name) # CPU
e = m.encode(texts, batch_size=256, normalize_embeddings=True, convert_to_numpy=True, show_progress_bar=True)
return e.astype(np.float32)

def build_faiss(embs: np.ndarray):
import faiss
idx = faiss.IndexFlatIP(embs.shape[1]); idx.add(embs)
return idx

PROMPT = “”"You are a careful analyst. Answer using ONLY the context.
If unsure, say you don’t know.

Question:
{q}

Context:
{ctx}

Answer:“”"

def ask_ollama(model: str, prompt: str, num_ctx: int = 4096) → str:
r = requests.post(“http://localhost:11434/api/generate”,
json={“model”: model, “prompt”: prompt, “options”: {“num_ctx”: num_ctx, “temperature”: 0.2}},
stream=True, timeout=600)
out =
for line in r.iter_lines():
if line:
obj = json.loads(line)
if “response” in obj: out.append(obj[“response”])
return “”.join(out)

def main():
ap = argparse.ArgumentParser()
ap.add_argument(“–docs”, required=True, help=“Glob for PDFs/TXT, e.g., data/*.pdf”)
ap.add_argument(“–model”, default=“llama3.1:8b-instruct-q4_K_M”, help=“Ollama model tag”)
ap.add_argument(“–k”, type=int, default=5)
args = ap.parse_args()

files = load_texts(args.docs)
if not files: raise SystemExit("No docs found. Provide PDFs or TXT.")
ids, texts = build_corpus(files)

print(f"Embedding {len(texts)} chunks on CPU...")
embs = embed(texts)
faiss = build_faiss(embs)
from sentence_transformers import SentenceTransformer
q_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

print("Ready. Ask a question (Ctrl+C to exit).")
while True:
    try:
        q = input("> ").strip()
    except KeyboardInterrupt:
        break
    q_emb = q_model.encode([q], normalize_embeddings=True, convert_to_numpy=True)[0].astype(np.float32)
    D, I = faiss.search(q_emb[None, :], args.k)
    ctx = "\n\n---\n\n".join(texts[i] for i in I[0])
    ans = ask_ollama(args.model, PROMPT.format(q=q, ctx=ctx[:12000]))
    print("\n" + ans.strip() + "\n")

if name == “main”:
main()

CPU-only dependencies

pip install sentence transformers faiss cpu pymupdf requests

Start a local CPU LLM

ollama pull llama3.1:8b-instruct-q4_K_M
ollama serve # in another terminal

Build a tiny knowledge base and chat

python cpu_rag_min.py --docs “./docs/*.{pdf,txt}” --model llama3.1:8b-instruct-q4_K_M

Response generated by TD Ai

2 Likes

I’m building a local SLM (Small Language Model) framework, where each agent performs a specific function.
One of the key agents is the Parsing Agent, responsible for extracting structured data (JSON) from unstructured insurance documents — e.g., policies, claim forms, and KYC documents.

The system stack:

  • Backend: FastAPI + LangChain

  • Model runtime: Ollama (local, quantized GGUF)

  • Hardware: Ubuntu 22.04, 32GB RAM, CPU-only

  • Goal: Achieve balance between accuracy and speed for offline document parsing.

I’ve tested these models so far:

  • :llama: LLaMA 3.1 8B (Q4_K_M) → Great accuracy but very slow on CPU

  • :llama: LLaMA 3.2 3B (Q4_K_M) → Fast but accuracy drops noticeably

  • :robot: Qwen 2.5 3B (Q4_K_M) → Fastest but misses structured fields during extraction

Questions:

  1. Which open-source instruct models perform best for structured text extraction on CPU (quantized GGUF)?

  2. Are there small instruction-tuned models (2–4B) designed for document understanding or key-value parsing?

  3. Is LoRA fine-tuning on a small dataset (~1000 labeled PDFs) useful for improving JSON accuracy?

  4. Would a hybrid setup (Qwen 2.5 for intent detection + LLaMA 8B for complex docs) make sense in CPU-only deployment?

  5. Any tips for optimizing Ollama / llama.cpp inference speed on CPU (threading, KV cache tuning, etc.)?

Thanks in advance! :folded_hands:
Any suggestions or shared experiences are welcome — especially from those who have balanced performance and accuracy in CPU-only document agents.

1 Like