I wanted to know what high level steps would be required if I was interested in fine tuning codeT5 for a code similarity task. Ideally I’d like to provide an input (single function) and return an embedding which works well with this task (ex: given 2 such embeddings, minimize cosine distance if they are related, otherwise maximize).
Input data is in the format
(example1, example2, is_match).
However, I am not sure exactly how to formulate this. Would I feed in sentence pairs to codeT5 and create a custom loss function using
Trainer which returns cosine distances? If so, how would I later fetch single embeddings if I wanted to visualize the distribution of my code examples?
Alternatively, would I a wrapper model which produces two codeT5 embeddings and computes the distance? Again, with this setup I am not sure how to get back to the single example scenario where I can pass in
example1 and retrieve an embedding
example1_emb that minimizes distance to some
example2 similar pair. Any guidance on going about this?