Best way to extract information from uggly pdf

I’m working on a project where i need to extract the price and the discount of a large number of PDF files that come from a large amount of sources where all the PDFs have different styles/layouts. ( the daily PDFs volumes is more that 200 and with a average of 50 pages per pdf coming from 30+ sources )

OCR is not a solution because most of the PDFs are uggly, unstructured,blurry

My goal is to parse PDFs one by one and extract information in a format like this :

{
  "type": "price_entry",
  "product": {
    "sky": "1213",
    "manufacturer_sku": "ABC-500",
    "brand": "BrandX",
    "name": "Product X"
  },
  "aliases": {
    "vendor_sku": "AC-500"
  },
  "vendor": {
    "name": "ACME",
    "region": "FR"
  },
  "terms": {
    "currency": "EUR",
    "purchase_price": 9.72,
    "list_price": 12.90,
    "min_order_qty": 50,
    "pack_size": "10u",
    "price_breaks": [
      {"min_qty": 1, "price": 10.10},
      {"min_qty": 100, "price": 9.50}
    ],
    "valid_from": "2025-09-01",
    "valid_to": null,
    "vat": 0.20,
    "discount": "5%"
  },
  "source": {
    "file": "pdf.pdf",
    "page": 7
  },
  "confidence": 0.9
}

currently i have a POC working with qwen2.5vl 72b ( 7b cannot process discount at all , 32b miss some discounts some time ) capable of extracting 1 pdf page in 1-2min on 2xL40s but i want to scale this because its not fitting my throughput ( sub 10 secondes maximum per page with the same hardware )

i think that a 72b model is wayyyyy to overkill for this but i dont know how to make the 32b better in my scenario or to make the 72b 6-10x faster

1 Like

You seem to have achieved a fair amount at this point, so I think things will progress smoothly once you find a more suitable existing VLM and backend