I’m quite new in this field, and despite having spent some good amount of time learning the ins and outs of frameworks like LangChain, and browsing around the internet quite a bit, I still don’t know what approach to follow for the use case below.
Let’s say I have a lot of information regarding a network infrastructure, assets, software, users, etc (e.g. what a CMDB would contain, or even simpler, a Windows Domain). This information is in a machine readable format (e.g. JSON, CSV, etc), and keeps getting updated daily (although eventual consistency is not a problem).
My goal is to feed this information to an LLM, explain the LLM what each piece of information means, and then be able to ask arbitrary (natural language) questions about the dataset.
The amount of information is big enough that I believe it can’t be sent as context to the prompt (similar to what LangChain would do).
I don’t think it makes sense either to use a vector DB to store this information because it does not have semantic meaning (e.g. a list of rows in a DB), therefore the questions asked to the LLM would not find related vectors (this is a guess as I have not tried it, but I don’t think it makes sense conceptually).
I read a little bit about fine tuning, but from what I’ve seen it’s about providing a lot of questions/answers to the LLM so it can learn, which I don’t think fits this use case, but maybe I am wrong. In the same manner, I don’t believe it makes sense to fine tune a LLM on a daily basis (this last sentence most likely does not make any sense).
I’d appreciate if someone could point me in the right direction.