Hi,
I am looking to fine-tune LLaMA 3 for a question-answering task using specific data that I have collected. Each question-answer pair in my dataset includes a timestamp (in the format MM/DD/YY) ranging from as far back as 2005 to as recent as 2024. While I intend to utilize all the data available, I would like the model to place greater emphasis on the most recent information.
Could you recommend a good strategy to achieve this?
One approach might be to prompt the LLM to achieve the task in a certain way. With specific enough instructions and a powerful enough LLM, this approach might show some promise:
### Instruction
You are tasked with generating an answer to the question you have been given.
You will be provided with source documents that provide information about the question.
The source documents contain a timestamp that can be used to analyze recency.
To generate a final answer, you must follow the steps below:
1. Sort the documents you have been provided with in the order of most to least recent
2. Select the three most recent documents
3. Generate an answer only using information contained in those documents
You can test and further improve this approach by trying to implement various prompt-engineering techniques like Chain-of-Thought.
Another approach would be to implement a retrieval-augmented generation (RAG) system. The benefit of a system like this would be that you can isolate the document searching and sorting from the generation pipeline. This leads to a smaller burden on the LLM, and you can work towards optimizing the document search process and LLM generation process separately.
You could use GPT-4o or any other powerful LLM to generate responses that you would want to see from your LLM and use that data for supervised fine-tuning. Using a powerful to generate training data is an idea that’s been around for some time now.
Some ideas in regards to dataset construction: you might want to prompt GPT so that it generates intermediate steps and not just a final output. For example, an ideal GPT response might look something like this:
Step 1. We want to only use documents that contain relevant information to the user query. Among the documents, Document 1 and Document 2 contain relevant information.
Step 2. We want to obtain the most recent information first, so we sort the documents in reverse chronological order. From most recent to least recent, the documents should be ordered [Document 2, Document 1]
Step 3. Generate a response using the most recent document, Document 2.
Final Response: ~~~~~
I would hope that fine-tuning on responses including chain of thought would help your model “learn” what you want it to do. But this may or may not help. A lot of testing is necessary for confirmation.
In terms of “how to prevent the model from generating based on old knowledge”, I’m not sure there really is any great way to achieve that. Specifying date ranges in your prompt may help, but in that case it really depends on how well the model follows the instructions you provide. You could gather a large amount of recent data, train the model on it, and hope for the best but that probably isn’t an efficient use of your time.
Rather than trying to train or fine-tune your model to generate under all sorts of restrictions, I would probably just implement a RAG system. Recency and reliability are what it was designed for.