How to improve Code Search with CodeBERT & ChromaDB

I’m builiding a code search system using CodeBERT embeddings (768 dimensions) and ChromaDB as the vector database however the retrieval results are not satisfactory. The dataset I am currently working with is :-

greengerong/leetcode

Current Approach :-

  1. Extract code snippets (C++, Java, Python, Javascript) from the parquet file.
  2. Tokenize and chunk code using tiktoken (cl100k_base) with a chunk size of 512.
  3. The chunked code is stored along with metadata like id, slug, title, difficulty, language, and chunk_id.
  4. Embeddings are computed using microsoft/codebert-base stored in ChromaDB, persistently. A unique chunk identifier (embedding_id) is generated using a SHA-256 hash of the chunk.
  5. Performing a similarity search in ChromaDB and re-ranking results using cross encoder nli-deberta-v3-base for relevance scoring.

Issues I am facing :-

  1. Retrieved results are often not highly relevant to the query, even for simpler searches.
  2. Some highly similar code snippets get ranked lower than less relevant ones.