How to improve Code Search with CodeBERT & ChromaDB

palindromeRice05 · March 11, 2025, 9:29am

I’m builiding a code search system using CodeBERT embeddings (768 dimensions) and ChromaDB as the vector database however the retrieval results are not satisfactory. The dataset I am currently working with is :-

greengerong/leetcode

Current Approach :-

Extract code snippets (C++, Java, Python, Javascript) from the parquet file.
Tokenize and chunk code using tiktoken (cl100k_base) with a chunk size of 512.
The chunked code is stored along with metadata like id, slug, title, difficulty, language, and chunk_id.
Embeddings are computed using microsoft/codebert-base stored in ChromaDB, persistently. A unique chunk identifier (embedding_id) is generated using a SHA-256 hash of the chunk.
Performing a similarity search in ChromaDB and re-ranking results using cross encoder nli-deberta-v3-base for relevance scoring.

Issues I am facing :-

Retrieved results are often not highly relevant to the query, even for simpler searches.
Some highly similar code snippets get ranked lower than less relevant ones.

John6666 · March 11, 2025, 4:19pm

Hmm… I asked to Hugging Chat.

To improve the retrieval relevance and ranking of your code search system, consider implementing the following structured approach:

1. Refine Chunking Strategy

Increase Context Window: Extend the chunk size from 512 tokens to 1024 tokens to capture a broader context within each code snippet.
Implement Overlapping Chunks: Overlap adjacent chunks to maintain continuity and ensure that code context isn’t lost at chunk boundaries.

2. Fine-tune CodeBERT Embeddings

Custom Training: Fine-tune the CodeBERT model on your specific dataset of LeetCode problems. This will tailor the embeddings to better reflect the nuances of your codebase, enhancing semantic capture.

3. Optimize ChromaDB Indexing

Experiment with Index Structures: Try different index structures, such as HNSW (Hierarchical Navigable Small World), which is effective for high-dimensional spaces and can improve retrieval accuracy and speed.

4. Enhance Re-ranking Methodology

Incorporate Additional Metrics: Combine the cross-encoder’s relevance scoring with other similarity metrics, such as structural comparisons using Abstract Syntax Trees (ASTs), to provide a more comprehensive ranking.

5. Adjust Tokenization Strategy

Code-Specific Tokenization: Consider using a tokenizer specifically designed for code to better handle programming syntax and tokens, which might currently be underperforming.

6. Leverage Metadata Effectively

Metadata-Based Ranking: Use metadata more strategically. For example, prioritize chunks based on language, difficulty, or other relevant fields to refine search results before and after the re-ranking process.

Each of these steps targets different aspects of the current system, aiming to enhance both the quality of the embeddings and the effectiveness of the search and ranking processes. By addressing these areas, you should see an improvement in retrieval relevance and the accurate ranking of similar code snippets.

Topic		Replies	Views
Help with CodeBERT-based Code Search - Random Results Issue Beginners	1	188	December 18, 2024
Which chunker to utilize for code based data Intermediate	1	139	March 12, 2025
Codebase Embedding Beginners	4	1453	January 30, 2025
Fine-tuning code embedding model for multilingual query-code pairs Intermediate	2	48	March 25, 2025
Sentence Similarity for Code Generation related tasks Beginners	1	872	March 28, 2022