Codebase Embedding

dksensei · January 23, 2025, 3:19pm

Hi, I want to build a RAG chatbot based on a large local codebase of mine. Does anyone know what the best method would be to generate embedding vectors for the codebase?

I am using DeepSeek-r1 for chat currently. For embeddings, I have no idea which model will handle this large codebase. Also, I have no idea how to feed the model with the directory structure and how files are interlinked.

Alanturner2 · January 23, 2025, 3:58pm

Building a RAG chatbot for a codebase is a great idea! Here’s a breakdown of how to handle embeddings and directory structure:

Code Splitting: Don’t embed entire files. Split your code into smaller chunks (e.g., functions, classes, or even smaller logical blocks). This improves retrieval accuracy.
Embedding Models: For code, consider these options:

Sentence Transformers (e.g., all-mpnet-base-v2, multi-qa-mpnet-base-dot-v1): Good general-purpose embeddings, often a solid starting point.
Code-Specific Models (e.g., CodeBERTa, GraphCodeBERT): These are trained on code and often perform better for code-related tasks. Sentence Transformers also offers code specific models.
OpenAI Embeddings (if budget allows): Very high quality but come with usage costs.

Directory Structure: You don’t directly “feed” the directory structure to the embedding model. Instead:

Include file paths/names in the metadata of each code chunk. This helps with context.
Consider creating a “summary” embedding for each file or directory. This can be used for a first-pass filtering before retrieving more granular chunks.

Vector Database: Use a vector database (e.g., FAISS, Chroma, Weaviate, Pinecone) to store and efficiently query your embeddings.
Chunking Strategy: Consider splitting code based on Abstract Syntax Trees (ASTs). Libraries like ast in Python can help with this. This ensures you’re embedding logical code units.

For DeepSeek-r1, the retrieval process would be:

Embed the user’s query.
Query the vector database to find relevant code chunks.
Include the retrieved chunks (and their metadata) as context for DeepSeek-r1.

Start with Sentence Transformers and a simple chunking strategy (e.g., splitting by functions). You can then experiment with more advanced techniques like AST-based chunking and code-specific embedding models.

dksensei · January 24, 2025, 4:42am

Thanks for the great response.

I have another question, I have some videos with the explanation of the codebase. How to build the data such that the model to suggest some portions of the videos (eg youtube do from this x second to y second is relevant for the question)?
I also have some diagrams with in pdfs?

John6666 · January 24, 2025, 5:55am

I don’t know much about VLM, but I’ve seen a multimodal model that can handle videos, so I’ll leave it here.

dksensei · January 30, 2025, 5:50am

Thanks. Although it doesn’t solve my issue, it’s nice to have something on hand.

Topic		Replies	Views
What is the best approach to let LLM to learn company internal legacy system Intermediate	6	371	April 8, 2025
Using RAG with local documents Models	3	3691	April 21, 2021
Use embeddings stored in vector db to reduce work for LLM generating response Intermediate	0	1587	February 19, 2024
Language model to search an answer in a huge collection of (unrelated) paragraphs Research	4	1525	July 6, 2021
Trying RAG with other Retriever Models 🤗Transformers	0	446	January 21, 2021

Codebase Embedding

Related topics