Help with CodeBERT-based Code Search - Random Results Issue

Hi every one,
I am new to using AI models and am building a simple code search engine using the pre-trained CodeBERT model. I generate embeddings for code functions using the following implementation:

from transformers import RobertaTokenizer, RobertaModel
Import torch



def generate_embedding(code):
    
 tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
    
model = RobertaModel.from_pretrained("microsoft/codebert-base", add_pooling_layer=False)
    
    
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
    
with torch.no_grad():
        
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    
return embedding.tolist()

Workflow:

  1. Code Preprocessing:
    Input code is formatted to lowercase, and newline characters are removed.

  2. Embedding Storage: I generate embeddings for code snippets using the above function and store them in Elasticsearch.

  3. Query Handling:

  • The user’s natural language query is converted to embeddings using the same function.

  • I compute cosine similarity between the query embedding and code embeddings stored in Elasticsearch.

  • The top 5 most similar code snippets are returned.

Problem:

The search results appear random and irrelevant to the input query. I suspect there might be an issue with how I process the inputs or compute the embeddings.

Questions:

  1. Should the input query (in natural language) be processed differently compared to code snippets?

  2. Are there any recommended ways to align embeddings for code and natural language queries using CodeBERT?

  3. Is my Python code correct for the problem I am trying to solve, or are there improvements I should make?

Any advice, corrections, or suggestions would be greatly appreciated! I am still learning and would love to understand what I might be doing wrong.

Thank you!

1 Like

After a deep dive into the CodeBERT repository, I’ve discovered how Microsoft’s code search logic works.

In the codesearch directory, there is a binary classifier that takes as input a vector where:

  • The first part of the vector represents the text (e.g., the user’s query).
  • The second part represents the source code.

The classifier then outputs a True or False value, indicating whether the given code snippet is consistent with the query (e.g., a comment or description).

I also found that when initializing the classifier in the same way as in the script from the CodeBERT repository:
run_classifier.py:

from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")

The classifier behaves differently each time it is initialized. This happens because the model is initialized with random weights, so it requires training from scratch before producing meaningful results.

1 Like