Starting a project and wanting "in" on the community

I have today, 44y, a masters in Accounting Science, and Bachelors in Economic Science. Long ago I did a Mechatronics Technology grade. Still not nearly as “in the know” for working on Language Models and stuff, so it seems a good thing to reach out to the community.

My goal is to make a suit that while not exactly “top notch”, can be run optimally in a small Linux box without GPU and up to 8Gb RAM. Currently I have to that end two experimental boxes running one Corei9 8Gb ram no GPU, and a 16Gb RAM Ryzen9, not using GPU either. For “benchmarking” I often rent a pod with good setup to tweak up things and see what runs, what doesnt, and what doesnt only because the “low machines”. The objective is a “boxed” Discord/Web/App chatbot for given applications, which can run from those low servers, or VPSs on a better machine.

Presently I am running a 1 Embed 3 layer LSMT 1 Dense model for unicode token type indexing, which perform well into the “learned texts” but it is marginally acceptable in “mixing things up”.
In the same machine (for the two machines), I run a py script to generate from the model TheBloke/Wizard-Vicuna-7B-Uncensored-GGUF and have testing chatting with both bots at the same in Discord.

It performs “surprisingly” well not ever using over 90% overall CPU, and barely ever using full memory in the 8Gb, plus some half swap.

From the more experienced users, what could be some experimenting I could do to step up the game a bit without straying from the objective of a “popular” set ?

The LSTM custom model is used as a failsafe, in case the the vicuna is too loaded to give responses (it is limited to 3 concurrent request, over which it doest respond. LSTM responds as many requests as it gets while memory and CPU allows it, for now). Also the LSTM does get content from the vicuna to train itself from time to time.

Mostly what I think could get better here is using another version of the vicuna, or another type of LM over the LSTM, which already “does its best” I think. I am using the Wizard-Vicuna-7B-Uncensored.Q4_K_M.gguf specific version.

The discord bots do some good chat, but for experiment. I would appreciate some pointers on how to fine tune the vicuna for specific application, even if I have to do it in a “powerful” GPU pod and then get it back to my humble server.

Thanks in advance for any tips.