Fine-tuning code embedding model for multilingual query-code pairs

I have a dataset with queries and multiple correct code solutions in different programming languages (e.g., Python, C++, Java). Within each language, there are also multiple correct solutions. How should I group query-code pairs for fine-tuning?

Option 1: Separate by language (e.g.,
Query 1 - Python 1, Python 2
Query 1 - Java 1, Java 2
Query 1 - C++ 1, C++ 2)

Option 2: Mix all languages (e.g.,
Query 1 - Python 1, Python 2, Java 1, Java 2, C++ 1, C++ 2)

Which approach is more suitable?

1 Like

hi @palindromeRice05
I think Option 1 looks better. However, do you have a suitable tokenizer for handling multiple programming languages?

1 Like

I am using SFR code embedding model. i am using the model’s tokenizer only.

2 Likes