Codebase Embedding

Hi, I want to build a RAG chatbot based on a large local codebase of mine. Does anyone know what the best method would be to generate embedding vectors for the codebase?

I am using DeepSeek-r1 for chat currently. For embeddings, I have no idea which model will handle this large codebase. Also, I have no idea how to feed the model with the directory structure and how files are interlinked.

1 Like

Building a RAG chatbot for a codebase is a great idea! Here’s a breakdown of how to handle embeddings and directory structure:

  1. Code Splitting: Don’t embed entire files. Split your code into smaller chunks (e.g., functions, classes, or even smaller logical blocks). This improves retrieval accuracy.
  2. Embedding Models: For code, consider these options:
  • Sentence Transformers (e.g., all-mpnet-base-v2, multi-qa-mpnet-base-dot-v1): Good general-purpose embeddings, often a solid starting point.
  • Code-Specific Models (e.g., CodeBERTa, GraphCodeBERT): These are trained on code and often perform better for code-related tasks. Sentence Transformers also offers code specific models.
  • OpenAI Embeddings (if budget allows): Very high quality but come with usage costs.
  1. Directory Structure: You don’t directly “feed” the directory structure to the embedding model. Instead:
  • Include file paths/names in the metadata of each code chunk. This helps with context.
  • Consider creating a “summary” embedding for each file or directory. This can be used for a first-pass filtering before retrieving more granular chunks.
  1. Vector Database: Use a vector database (e.g., FAISS, Chroma, Weaviate, Pinecone) to store and efficiently query your embeddings.
  2. Chunking Strategy: Consider splitting code based on Abstract Syntax Trees (ASTs). Libraries like ast in Python can help with this. This ensures you’re embedding logical code units.

For DeepSeek-r1, the retrieval process would be:

  1. Embed the user’s query.
  2. Query the vector database to find relevant code chunks.
  3. Include the retrieved chunks (and their metadata) as context for DeepSeek-r1.

Start with Sentence Transformers and a simple chunking strategy (e.g., splitting by functions). You can then experiment with more advanced techniques like AST-based chunking and code-specific embedding models.

2 Likes

Thanks for the great response.

I have another question, I have some videos with the explanation of the codebase. How to build the data such that the model to suggest some portions of the videos (eg youtube do from this x second to y second is relevant for the question)?
I also have some diagrams with in pdfs?

1 Like

I don’t know much about VLM, but I’ve seen a multimodal model that can handle videos, so I’ll leave it here.

1 Like

Thanks. Although it doesn’t solve my issue, it’s nice to have something on hand.

1 Like