I wanted to know what high level steps would be required if I was interested in fine tuning codeT5 for a code similarity task. Ideally I’d like to provide an input (single function) and return an embedding which works well with this task (ex: given 2 such embeddings, minimize cosine distance if they are related, otherwise maximize).

Input data is in the format `(example1, example2, is_match)`

.

However, I am not sure exactly how to formulate this. Would I feed in sentence pairs to codeT5 and create a custom loss function using `Trainer`

which returns cosine distances? If so, how would I later fetch single embeddings if I wanted to visualize the distribution of my code examples?

Alternatively, would I a wrapper model which produces two codeT5 embeddings and computes the distance? Again, with this setup I am not sure how to get back to the single example scenario where I can pass in `example1`

and retrieve an embedding `example1_emb`

that minimizes distance to some `example2`

similar pair. Any guidance on going about this?