How should i structure data in RAG?

homericGames · March 17, 2025, 1:31pm

Hi, i plan to create an ai agent that has around 70 pages long data about a friend of mine, and could answer various questions about him (about career, family,personality etc…). the data is currently structured as an interview - closely related questions about the same subject are near each other , something like :

fitness :
Do you have an exrcise routine? tell us about it
yes, i got to the gym twice a week, also…
Did you ever won a prize in a sport competition? tell us about it
I once won silver medal in a local competition back in my hometown at the age of 15, it was a…
…

music:
what is your taste when it comes to music?
…

My question is, should i keep the data as is before embdding it in a vector DB or should i first remove the questions and create more summary-like structure?

*note : this data is served as the general knowledge base of the model and not for fine tunning (if i ever will do that).

John6666 · March 17, 2025, 2:35pm

The following is a general overview that I had the chatbot summarize, but in my personal opinion, I think the raw data would be better.

If you have LLM read the data found by DB search and reformat it, I don’t think there are many advantages to having it summarized, other than speed. Unless the facts change, there won’t be much difference in the output results.
Processing the data is also time-consuming…

When creating an AI agent to answer questions about your friend based on a 70-page interview, the decision to summarize or keep the data as is involves trade-offs between data completeness and processing efficiency. Here’s a structured summary of the considerations:

Approach 1: Summarizing the Data

Advantages:

Efficiency: Summarized data reduces redundancy, making embeddings more compact and retrieval faster.
Pertinence: Concise summaries may improve the AI’s ability to retrieve relevant information quickly, enhancing response efficiency.

Disadvantages:

Loss of Nuance: Simplification risks omitting important details, potentially hindering the AI’s ability to answer detailed or nuanced queries.
Summarization Effort: Requires careful summarization to ensure all key points are retained, which can be time-consuming.

Approach 2: Keeping Data as Is

Advantages:

Data Completeness: Preserves all details, allowing the AI to answer detailed questions accurately.
Contextual Patterns: Maintains the interview structure, which might help the AI recognize conversational patterns, aiding in context-aware responses.

Disadvantages:

Redundancy and Size: Larger, more redundant data can lead to less efficient embeddings and slower processing.
Processing Overload: Larger datasets may strain the AI’s processing capabilities, affecting performance.

Conclusion:

If the AI agent needs to handle detailed, nuanced inquiries, preserving the original data structure despite larger embeddings might be preferable. Conversely, if efficiency and quick response times are prioritized, summarizing the data could offer advantages. Consider the specific needs of your AI application and the balance between detail retention and processing efficiency.

Topic		Replies	Views
Connect external data (RAG) to AI agent in minutes Show and Tell	2	45	February 17, 2025
Seeking Advice on Processing Support Conversations for Efficient RAG Model Search Intermediate	0	53	September 9, 2024
Q&A chatbot using timeseries data (Need efficient approach) Beginners	2	31	August 14, 2025
Can t5 transformer can be used to summarize conversations 🤗Transformers	1	445	January 19, 2021
Summarization for survey open end questions 🤗Transformers	0	227	January 29, 2024

How should i structure data in RAG?

Approach 1: Summarizing the Data

Approach 2: Keeping Data as Is

Conclusion:

Related topics