I’m builiding a code search system using CodeBERT embeddings (768 dimensions) and ChromaDB as the vector database however the retrieval results are not satisfactory. The dataset I am currently working with is :-
greengerong/leetcode
Current Approach :-
- Extract code snippets (C++, Java, Python, Javascript) from the parquet file.
- Tokenize and chunk code using tiktoken (cl100k_base) with a chunk size of 512.
- The chunked code is stored along with metadata like id, slug, title, difficulty, language, and chunk_id.
- Embeddings are computed using microsoft/codebert-base stored in ChromaDB, persistently. A unique chunk identifier (embedding_id) is generated using a SHA-256 hash of the chunk.
- Performing a similarity search in ChromaDB and re-ranking results using cross encoder nli-deberta-v3-base for relevance scoring.
Issues I am facing :-
- Retrieved results are often not highly relevant to the query, even for simpler searches.
- Some highly similar code snippets get ranked lower than less relevant ones.