I’m looking for publicly available models to extract email text from its HTML code. I can do it via chatgpt or cloud based models like amazon titan or claude. But is there any other choices from free models?
If you’re able to do this with ChatGPT, chances are high you can also do this with open-source LLMs. At the time of writing, the best models to try out include:
- LLaMa-3 (comes in 8B and 70B parameter variants): Meta Llama 3 - a meta-llama Collection
- Mistral-7B (instruction tuned, v2): mistralai/Mistral-7B-Instruct-v0.2 · Hugging Face
- Mixtral-8x7B (instruction tuned): mistralai/Mixtral-8x7B-Instruct-v0.1 · Hugging Face
- Mixtral-8x22B (instruction tuned): mistral-community/Mixtral-8x22B-v0.1 · Hugging Face
Those models are too big to use on my local GPU machines. I wonder if I need cloud resources to use them for inference, why shouldn’t I directly go for big models like GPT, or AWS stack?
It is still not clear to me how people use these big open-source LLMs, especially for production.
Open-source vs. closed-source have their pros and cons. E.g. if you care about latency, data privacy and have high volumes, then open-source might be the better option. If you want to just rely on an API that handles everything for you, don’t care about data privacy, then closed-source might be the better option.
Open-source typically becomes cheaper if you run at scale (high volume), cause then you only need to pay for GPU costs, whereas with APIs you get charged for the amount of tokens that you send through them.
What’s the memory size of your local GPUs? Nowadays people can run very powerful LLMs even on consumer hardware, Microsoft’s Phi-3 is an example of that: microsoft/Phi-3-mini-128k-instruct · Hugging Face.
Also, the Transformers library offers quantization, which significantly shrinks down the size of the models, see for instance Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA.
Hi Nielsr,
Thank you for the information. We have a machine with an RTX A4000 GPU with 16GB RAM and a 32-core CPU with 125GB memory on the system. I also thought about buying more GPUs of the same model, but I guess multiple GPUs won’t add too much value compared to buying a single superpower GPU, for which we may not have budget priority.
So far, I have tested some models, such as FLanT5 or T5, and language translators, like opus-mt-de-en, via GPU-based inference. Even if I manage to test larger models, it’s not a robust plan for deploying them in production.
I tried Peft and post-training quantization on some demo tutorials, but I still need big compute resources for FT training jobs. Also, when it comes to cloud resources, I’d like to use only their compute power and not get locked into their other managed services like Vector embedding, agents, and … I want to create my pipelines locally (e.g. via langchain) and use cloud resources only for inference or training.
Besides openAI, I also started investigating with AWS bedrock. But I guess they try to sell more native services to you than you actually may need.
What other information sources do you recommend for staying current with evolving best practices and available resources in this area?