Teaching an LLM about your domain only with machine readable data

Hey folks,

I’m quite new in this field, and despite having spent some good amount of time learning the ins and outs of frameworks like LangChain, and browsing around the internet quite a bit, I still don’t know what approach to follow for the use case below.

Let’s say I have a lot of information regarding a network infrastructure, assets, software, users, etc (e.g. what a CMDB would contain, or even simpler, a Windows Domain). This information is in a machine readable format (e.g. JSON, CSV, etc), and keeps getting updated daily (although eventual consistency is not a problem).

My goal is to feed this information to an LLM, explain the LLM what each piece of information means, and then be able to ask arbitrary (natural language) questions about the dataset.

The amount of information is big enough that I believe it can’t be sent as context to the prompt (similar to what LangChain would do).

I don’t think it makes sense either to use a vector DB to store this information because it does not have semantic meaning (e.g. a list of rows in a DB), therefore the questions asked to the LLM would not find related vectors (this is a guess as I have not tried it, but I don’t think it makes sense conceptually).

I read a little bit about fine tuning, but from what I’ve seen it’s about providing a lot of questions/answers to the LLM so it can learn, which I don’t think fits this use case, but maybe I am wrong. In the same manner, I don’t believe it makes sense to fine tune a LLM on a daily basis (this last sentence most likely does not make any sense).

I’d appreciate if someone could point me in the right direction.

I was able to create a first implementation of this using a LangChain agent that dynamically queries a dataset, spits Python, and executes it. All of this can be done with, for example, the CSV Agent from LangChain.

I modelled the dataset schema to have more descriptive information and it does the trick. However, this approach has many gaps. The questions one is able to ask need to be crafted in a way that is coupled with the schema semantics.

The next step is how to query the dataset using high-level abstractions as the questions.

Hey newlog. First off, I’m also a complete newbie here, so treat these comments as such.

I’m also looking to do something similar. I also went down the LanChain approach and found the same drawbacks as you which is that the language used in the query needs to match the language used in your domain data.

I have come to the conclusion that to do this correctly requires fine training the model. Have you seen this post? Could you include your company’s domain specific json/csv along with an explanation about the format of the files and do fine tuning on an existing model with your new data?

You can just make a Chain:

Get the initial question
Transform the question in a structured question
Get the data
Format the data in a user friendly response using the original question as basis and return it

The flow much like the Chat with Data