How to improve Code Search with CodeBERT & ChromaDB

palindromeRice05 · March 11, 2025, 9:29am

I’m builiding a code search system using CodeBERT embeddings (768 dimensions) and ChromaDB as the vector database however the retrieval results are not satisfactory. The dataset I am currently working with is :-

greengerong/leetcode

Current Approach :-

Extract code snippets (C++, Java, Python, Javascript) from the parquet file.
Tokenize and chunk code using tiktoken (cl100k_base) with a chunk size of 512.
The chunked code is stored along with metadata like id, slug, title, difficulty, language, and chunk_id.
Embeddings are computed using microsoft/codebert-base stored in ChromaDB, persistently. A unique chunk identifier (embedding_id) is generated using a SHA-256 hash of the chunk.
Performing a similarity search in ChromaDB and re-ranking results using cross encoder nli-deberta-v3-base for relevance scoring.

Issues I am facing :-

Retrieved results are often not highly relevant to the query, even for simpler searches.
Some highly similar code snippets get ranked lower than less relevant ones.

Topic		Replies	Views
Help with CodeBERT-based Code Search - Random Results Issue Beginners	1	225	December 18, 2024
Which chunker to utilize for code based data Intermediate	1	222	March 12, 2025
Codebase Embedding Beginners	4	2112	January 30, 2025
What is the best approach to let LLM to learn company internal legacy system Intermediate	6	324	April 8, 2025
Vector search returns almost random results Models	3	493	February 10, 2024

How to improve Code Search with CodeBERT & ChromaDB

Related topics