Alot of questions, or, How can i run models locally (for an absolute begginger)

blahincubator · July 3, 2025, 5:25am

Hello everyone

I am new to AI but i run my own home server and have some experience with running services locally but still basic.

I found this model, it looks awesome for what i want which is to digitize recipe books. I tried it out in “spaces” and its incredibly powerful and accurate.

I have so many questions

How do you go about finding the best models?
Is there a gui similar to the spaces edition in the self hosted version?
I have a 3060 ti, is it ok?
I have a service running called mealie where you can import recipes via an OpenAI API, is there a way i can use a local ai for this or does it have to be done through open ai?
What is a GPT? Is it like a specified AI that does a certain task? is it the same as an agent?
What else can i do with AI? This is new to me and after browsing hugging face im excited by the possibilities

Thank you for reading

John6666 · July 3, 2025, 6:08am

Hi.

1

You may come across them by chance on Hugging Face, social media, or Discord. If you are looking for a specific model for a particular purpose, use a search engine with queries such as “VLM (or any model type) OCR (or any purpose) Hugging Face” or “VLM OCR GitHub.” or so. You can also enable web search for Gemini or ChatGPT and ask them.

You can also search for Spaces directly, but it is more convenient to use benchmark rankings called leaderboards.
If you know the model’s name, searching from the Hugging Face model page is straightforward.
You can also ask directly on Hugging Face Discord or Hugging Face support.

2

Basically we can use as is.

3

It’s the same GPU as mine. The OCR model used in that space is around 3B parameters to 8B parameters in size. app.py · prithivMLmods/Multimodal-OCR at main
With quantization (think of it as compression), they all work. 3B is no problem. Without quantization, it’s a bit tough, but it still works. It is possible to supplement VRAM with RAM.It will be slower, though…

You might need to add a few lines of code for quantization. Since all the code in that Spaces project is readable, I think it should be manageable with some modifications.

4

OpenAI-compatible endpoints such as Ollama, vLLM, and TGI seem to be usable with that service. You should also be able to use models hosted on your own PC. I haven’t tried it myself, though…

5

by ChatGPT:

A GPT (Generative Pre-trained Transformer) is a large language model—a neural network trained on massive text data—that can generate and understand human-like text across many tasks by predicting the next word in a sequence.
An agent, by contrast, is a system that uses models like GPT plus additional components (planning, tool-calling, memory) to act autonomously in an environment—GPT does the “thinking,” while an agent wraps that thinking into goal-directed behavior

6

Try Spaces further and maybe HF Learn, HF Discord, or massive resources online.

blahincubator · July 4, 2025, 12:32am

John, thank you so much for this detailed explanation and the time you put into this. I really appreciate it and it helps alot.

I wasn’t aware they could be run on docker. I have used docker afew times and i think i could probably stumble my way through figuring out how to run this one.

One follow up question, qith quantization, how will i know which models are already quantized and if not, what modifications do i need to make? Ill do abit search online about this question also

Thank you again

John6666 · July 4, 2025, 12:45am

how will i know which models are already quantized and if not, what modifications do i need to make?

Quantization often results in the loss of information necessary for model improvement, so authors typically distribute the pre-quantized model. In some cases, the post-quantized model is distributed for user convenience, but the simplest approach is to perform quantization on the fly as follows. This reduces VRAM consumption by a factor of four. The accuracy drops slightly, but the difference is probably imperceptible.

# pip install -U bitsandbytes accelerate
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16) # https://huggingface.co/blog/4bit-transformers-bitsandbytes#advanced-usage

# Load Nanonets-OCR-s
MODEL_ID_V = "nanonets/Nanonets-OCR-s"
processor_v = AutoProcessor.from_pretrained(MODEL_ID_V, trust_remote_code=True)
model_v = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID_V,
    trust_remote_code=True,
    device_map="auto", # or ="cuda". on-the-fly quantization basically requires GPU env.
    torch_dtype=torch.bfloat16, # on 3060Ti, bfloat16 is faster than float16
    quantization_config=nf4_config # apply on-the-fly quantization
).eval()

Topic		Replies	Views
Advice for locally run AI Assistant Beginners	6	947	March 10, 2025
How to download and use Models Beginners	1	3197	June 15, 2024
Run models on a desktop computer? Beginners	7	81418	June 16, 2024
Hugging Face for creatives Beginners	2	193	January 31, 2025
Can someone point me to docs for how to train my own a model? Models	2	621	January 3, 2023

Alot of questions, or, How can i run models locally (for an absolute begginger)

Related topics