I wanted to know what high level steps would be required if I was interested in fine tuning codeT5 for a code similarity task. Ideally I’d like to provide an input (single function) and return an embedding which works well with this task (ex: given 2 such embeddings, minimize cosine distance if they are related, otherwise maximize).
Input data is in the format (example1, example2, is_match)
.
However, I am not sure exactly how to formulate this. Would I feed in sentence pairs to codeT5 and create a custom loss function using Trainer
which returns cosine distances? If so, how would I later fetch single embeddings if I wanted to visualize the distribution of my code examples?
Alternatively, would I a wrapper model which produces two codeT5 embeddings and computes the distance? Again, with this setup I am not sure how to get back to the single example scenario where I can pass in example1
and retrieve an embedding example1_emb
that minimizes distance to some example2
similar pair. Any guidance on going about this?