Hello everyone
I am new to AI but i run my own home server and have some experience with running services locally but still basic.
I found this model, it looks awesome for what i want which is to digitize recipe books. I tried it out in “spaces” and its incredibly powerful and accurate.
I have so many questions
-
How do you go about finding the best models?
-
Is there a gui similar to the spaces edition in the self hosted version?
-
I have a 3060 ti, is it ok?
-
I have a service running called mealie where you can import recipes via an OpenAI API, is there a way i can use a local ai for this or does it have to be done through open ai?
-
What is a GPT? Is it like a specified AI that does a certain task? is it the same as an agent?
-
What else can i do with AI? This is new to me and after browsing hugging face im excited by the possibilities
Thank you for reading
1 Like
Hi.
1
You may come across them by chance on Hugging Face, social media, or Discord. If you are looking for a specific model for a particular purpose, use a search engine with queries such as “VLM (or any model type) OCR (or any purpose) Hugging Face” or “VLM OCR GitHub.” or so. You can also enable web search for Gemini or ChatGPT and ask them.
You can also search for Spaces
directly, but it is more convenient to use benchmark rankings called leaderboards.
If you know the model’s name, searching from the Hugging Face model page is straightforward.
You can also ask directly on Hugging Face Discord or Hugging Face support.
2
Basically we can use as is.
3
It’s the same GPU as mine. The OCR model used in that space is around 3B parameters to 8B parameters in size. app.py · prithivMLmods/Multimodal-OCR at main
With quantization (think of it as compression), they all work. 3B is no problem. Without quantization, it’s a bit tough, but it still works. It is possible to supplement VRAM with RAM.It will be slower, though…
You might need to add a few lines of code for quantization. Since all the code in that Spaces project is readable, I think it should be manageable with some modifications.
4
OpenAI-compatible endpoints such as Ollama, vLLM, and TGI seem to be usable with that service. You should also be able to use models hosted on your own PC. I haven’t tried it myself, though…
5
by ChatGPT:
A GPT (Generative Pre-trained Transformer) is a large language model—a neural network trained on massive text data—that can generate and understand human-like text across many tasks by predicting the next word in a sequence.
An agent, by contrast, is a system that uses models like GPT plus additional components (planning, tool-calling, memory) to act autonomously in an environment—GPT does the “thinking,” while an agent wraps that thinking into goal-directed behavior
6
Try Spaces further and maybe HF Learn, HF Discord, or massive resources online.
John, thank you so much for this detailed explanation and the time you put into this. I really appreciate it and it helps alot.
I wasn’t aware they could be run on docker. I have used docker afew times and i think i could probably stumble my way through figuring out how to run this one.
One follow up question, qith quantization, how will i know which models are already quantized and if not, what modifications do i need to make? Ill do abit search online about this question also
Thank you again
1 Like
how will i know which models are already quantized and if not, what modifications do i need to make?
Quantization often results in the loss of information necessary for model improvement, so authors typically distribute the pre-quantized model. In some cases, the post-quantized model is distributed for user convenience, but the simplest approach is to perform quantization on the fly as follows. This reduces VRAM consumption by a factor of four. The accuracy drops slightly, but the difference is probably imperceptible.
# pip install -U bitsandbytes accelerate
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) # https://huggingface.co/blog/4bit-transformers-bitsandbytes#advanced-usage
# Load Nanonets-OCR-s
MODEL_ID_V = "nanonets/Nanonets-OCR-s"
processor_v = AutoProcessor.from_pretrained(MODEL_ID_V, trust_remote_code=True)
model_v = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID_V,
trust_remote_code=True,
device_map="auto", # or ="cuda". on-the-fly quantization basically requires GPU env.
torch_dtype=torch.bfloat16, # on 3060Ti, bfloat16 is faster than float16
quantization_config=nf4_config # apply on-the-fly quantization
).eval()