Hi every one,
I am new to using AI models and am building a simple code search engine using the pre-trained CodeBERT model. I generate embeddings for code functions using the following implementation:
from transformers import RobertaTokenizer, RobertaModel
Import torch
def generate_embedding(code):
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base", add_pooling_layer=False)
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512, padding="max_length")
with torch.no_grad():
outputs = model(**inputs)
embedding = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
return embedding.tolist()
Workflow:
-
Code Preprocessing:
Input code is formatted to lowercase, and newline characters are removed. -
Embedding Storage: I generate embeddings for code snippets using the above function and store them in Elasticsearch.
-
Query Handling:
-
The user’s natural language query is converted to embeddings using the same function.
-
I compute cosine similarity between the query embedding and code embeddings stored in Elasticsearch.
-
The top 5 most similar code snippets are returned.
Problem:
The search results appear random and irrelevant to the input query. I suspect there might be an issue with how I process the inputs or compute the embeddings.
Questions:
-
Should the input query (in natural language) be processed differently compared to code snippets?
-
Are there any recommended ways to align embeddings for code and natural language queries using CodeBERT?
-
Is my Python code correct for the problem I am trying to solve, or are there improvements I should make?
Any advice, corrections, or suggestions would be greatly appreciated! I am still learning and would love to understand what I might be doing wrong.
Thank you!