Fine-tuning a language model on domain specific embeddings

Palaash · May 4, 2023, 5:29pm

Hi,
I’m trying to fine-tune an LLM for a regression problem. My dataset consists of DNA sequences and a corresponding label value. I have another model which takes in a DNA sequence and generates embedding for the DNA sequence. My task is to join multiple DNA sequences one after the other. For example, consider a set of DNA sequences and a corresponding label value. I use my embedding model to generate one embedding for each DNA sequence. So, for sample 1, I have 10 DNA sequences and their corresponding embeddings. I want to feed these embeddings to an LLM and fine-tune the LLM on these embeddings instead of sequences directly. How should I go about solving this problem?

I know transformers use a tokenizer to convert words to ids and their embeddings, but can’t I directly feed embeddings (which won’t be trained further) into the transformer model without going through the tokenizer route?

cparish · November 21, 2023, 11:15pm

Hey @Palaash , I have a similar interest. Any luck achieving this direct feeding of embeddings?

Topic		Replies	Views
Transformers + Attention / or LLMs in other contexts: (I.e. AlphaFold, ForceGen, etc) Beginners	0	150	March 12, 2024
Embedding Model Fine Tuning Spaces	2	33	May 8, 2025
Fine tuning a sentence transformer model for [single_sentence, label] format? 🤗Transformers	0	505	February 13, 2023
Fine-Tuning Pre-trained Models Issues and Gotchas Beginners	2	600	March 26, 2021
Fine-tuning BERT Model on domain specific language and for classification 🤗Transformers	7	8423	November 14, 2024

Fine-tuning a language model on domain specific embeddings

Related topics