Yes, this is actually real.
You can now run unquantized Mistral-7B-Instruct on about 2gb of ram.
First off, here is the github link: GitHub - garagesteve1155/ai_splitter: Run Mistral 7b Instruct on less than 4gb of RAM
Here is a video explanation of how it works: https://youtu.be/53Do4whfrqE?si=yMmy5iavmzw_qNgv
Most solutions (Ollama, llama.cpp, etc.) require quantization to get large models working on small hardware. This one doesn’t.
ai_splitter uses a simple but effective layer-by-layer loading strategy, keeping memory usage low (~2GB), while still using the full FP16 .safetensors weights from Hugging Face.
Why it matters:
No GPU needed
No quantization
Works on super low-end hardware like Pi 5 or mini PCs
Only requires ~2GB RAM
CPU-only, pure Python
More supported models are coming soon.
Yes, it’s SLOW!!! Hahaha… but it works. Bout to boot up my pi lol