Help with CodeBERT-based Code Search - Random Results Issue

mkorbecki · December 17, 2024, 8:24pm

Hi every one,
I am new to using AI models and am building a simple code search engine using the pre-trained CodeBERT model. I generate embeddings for code functions using the following implementation:

from transformers import RobertaTokenizer, RobertaModel
Import torch



def generate_embedding(code):
    
 tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
    
model = RobertaModel.from_pretrained("microsoft/codebert-base", add_pooling_layer=False)
    
    
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
    
with torch.no_grad():
        
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    
return embedding.tolist()

Workflow:

Code Preprocessing:
Input code is formatted to lowercase, and newline characters are removed.
Embedding Storage: I generate embeddings for code snippets using the above function and store them in Elasticsearch.
Query Handling:

The user’s natural language query is converted to embeddings using the same function.
I compute cosine similarity between the query embedding and code embeddings stored in Elasticsearch.
The top 5 most similar code snippets are returned.

Problem:

The search results appear random and irrelevant to the input query. I suspect there might be an issue with how I process the inputs or compute the embeddings.

Questions:

Should the input query (in natural language) be processed differently compared to code snippets?
Are there any recommended ways to align embeddings for code and natural language queries using CodeBERT?
Is my Python code correct for the problem I am trying to solve, or are there improvements I should make?

Any advice, corrections, or suggestions would be greatly appreciated! I am still learning and would love to understand what I might be doing wrong.

Thank you!

mkorbecki · December 18, 2024, 5:46pm

After a deep dive into the CodeBERT repository, I’ve discovered how Microsoft’s code search logic works.

In the codesearch directory, there is a binary classifier that takes as input a vector where:

The first part of the vector represents the text (e.g., the user’s query).
The second part represents the source code.

The classifier then outputs a True or False value, indicating whether the given code snippet is consistent with the query (e.g., a comment or description).

I also found that when initializing the classifier in the same way as in the script from the CodeBERT repository:
run_classifier.py:

from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base")

The classifier behaves differently each time it is initialized. This happens because the model is initialized with random weights, so it requires training from scratch before producing meaningful results.

Topic		Replies	Views
EncoderDecoderModel output all pad token 🤗Transformers	1	528	June 2, 2022
Getting different sentence embeddings when using model on CPU and GPU Beginners	0	2272	August 26, 2022
[URGENT] Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task 🤗Transformers	6	189	March 19, 2024
Inconsistencies between BERT and RoBERTa: what am I doing wrong? Beginners	0	356	May 11, 2022
Identical CLS token embeddings for all different sentences? Beginners	1	434	April 17, 2023

Help with CodeBERT-based Code Search - Random Results Issue

Workflow:

Problem:

Questions:

Related topics