How to fine tune T5 for code similarity

maximus12793 · August 25, 2022, 7:53pm

I wanted to know what high level steps would be required if I was interested in fine tuning codeT5 for a code similarity task. Ideally I’d like to provide an input (single function) and return an embedding which works well with this task (ex: given 2 such embeddings, minimize cosine distance if they are related, otherwise maximize).

Input data is in the format (example1, example2, is_match).

However, I am not sure exactly how to formulate this. Would I feed in sentence pairs to codeT5 and create a custom loss function using Trainer which returns cosine distances? If so, how would I later fetch single embeddings if I wanted to visualize the distribution of my code examples?

Alternatively, would I a wrapper model which produces two codeT5 embeddings and computes the distance? Again, with this setup I am not sure how to get back to the single example scenario where I can pass in example1 and retrieve an embedding example1_emb that minimizes distance to some example2 similar pair. Any guidance on going about this?

Topic		Replies	Views
Fine-tuning CodeT5 for Regression Beginners	0	156	February 13, 2024
E5 embedding models 🤗Transformers	1	20	March 17, 2025
Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice 🤗Transformers	2	1203	April 20, 2024
Identifying and getting right embeddings from the fine tuned BERT on domain specific data Intermediate	0	1331	September 8, 2021
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	505	February 13, 2023

How to fine tune T5 for code similarity

Related topics