I’m working on a project where i need to extract the price and the discount of a large number of PDF files that come from a large amount of sources where all the PDFs have different styles/layouts. ( the daily PDFs volumes is more that 200 and with a average of 50 pages per pdf coming from 30+ sources )
OCR is not a solution because most of the PDFs are uggly, unstructured,blurry
My goal is to parse PDFs one by one and extract information in a format like this :
{
"type": "price_entry",
"product": {
"sky": "1213",
"manufacturer_sku": "ABC-500",
"brand": "BrandX",
"name": "Product X"
},
"aliases": {
"vendor_sku": "AC-500"
},
"vendor": {
"name": "ACME",
"region": "FR"
},
"terms": {
"currency": "EUR",
"purchase_price": 9.72,
"list_price": 12.90,
"min_order_qty": 50,
"pack_size": "10u",
"price_breaks": [
{"min_qty": 1, "price": 10.10},
{"min_qty": 100, "price": 9.50}
],
"valid_from": "2025-09-01",
"valid_to": null,
"vat": 0.20,
"discount": "5%"
},
"source": {
"file": "pdf.pdf",
"page": 7
},
"confidence": 0.9
}
currently i have a POC working with qwen2.5vl 72b ( 7b cannot process discount at all , 32b miss some discounts some time ) capable of extracting 1 pdf page in 1-2min on 2xL40s but i want to scale this because its not fitting my throughput ( sub 10 secondes maximum per page with the same hardware )
i think that a 72b model is wayyyyy to overkill for this but i dont know how to make the 32b better in my scenario or to make the 72b 6-10x faster