How to fine tune T5 for code similarity

I wanted to know what high level steps would be required if I was interested in fine tuning codeT5 for a code similarity task. Ideally I’d like to provide an input (single function) and return an embedding which works well with this task (ex: given 2 such embeddings, minimize cosine distance if they are related, otherwise maximize).

Input data is in the format (example1, example2, is_match).

However, I am not sure exactly how to formulate this. Would I feed in sentence pairs to codeT5 and create a custom loss function using Trainer which returns cosine distances? If so, how would I later fetch single embeddings if I wanted to visualize the distribution of my code examples?

Alternatively, would I a wrapper model which produces two codeT5 embeddings and computes the distance? Again, with this setup I am not sure how to get back to the single example scenario where I can pass in example1 and retrieve an embedding example1_emb that minimizes distance to some example2 similar pair. Any guidance on going about this?