Hi, I want to build a RAG chatbot based on a large local codebase of mine. Does anyone know what the best method would be to generate embedding vectors for the codebase?
I am using DeepSeek-r1 for chat currently. For embeddings, I have no idea which model will handle this large codebase. Also, I have no idea how to feed the model with the directory structure and how files are interlinked.
1 Like
Building a RAG chatbot for a codebase is a great idea! Here’s a breakdown of how to handle embeddings and directory structure:
- Code Splitting: Don’t embed entire files. Split your code into smaller chunks (e.g., functions, classes, or even smaller logical blocks). This improves retrieval accuracy.
- Embedding Models: For code, consider these options:
- Sentence Transformers (e.g.,
all-mpnet-base-v2
, multi-qa-mpnet-base-dot-v1
): Good general-purpose embeddings, often a solid starting point.
- Code-Specific Models (e.g., CodeBERTa, GraphCodeBERT): These are trained on code and often perform better for code-related tasks. Sentence Transformers also offers code specific models.
- OpenAI Embeddings (if budget allows): Very high quality but come with usage costs.
- Directory Structure: You don’t directly “feed” the directory structure to the embedding model. Instead:
- Include file paths/names in the metadata of each code chunk. This helps with context.
- Consider creating a “summary” embedding for each file or directory. This can be used for a first-pass filtering before retrieving more granular chunks.
- Vector Database: Use a vector database (e.g., FAISS, Chroma, Weaviate, Pinecone) to store and efficiently query your embeddings.
- Chunking Strategy: Consider splitting code based on Abstract Syntax Trees (ASTs). Libraries like
ast
in Python can help with this. This ensures you’re embedding logical code units.
For DeepSeek-r1, the retrieval process would be:
- Embed the user’s query.
- Query the vector database to find relevant code chunks.
- Include the retrieved chunks (and their metadata) as context for DeepSeek-r1.
Start with Sentence Transformers and a simple chunking strategy (e.g., splitting by functions). You can then experiment with more advanced techniques like AST-based chunking and code-specific embedding models.
2 Likes
Thanks for the great response.
I have another question, I have some videos with the explanation of the codebase. How to build the data such that the model to suggest some portions of the videos (eg youtube do from this x second to y second is relevant for the question)?
I also have some diagrams with in pdfs?
1 Like
I don’t know much about VLM, but I’ve seen a multimodal model that can handle videos, so I’ll leave it here.
1 Like
Thanks. Although it doesn’t solve my issue, it’s nice to have something on hand.
1 Like