Fine-tuning a language model on domain specific embeddings

I’m trying to fine-tune an LLM for a regression problem. My dataset consists of DNA sequences and a corresponding label value. I have another model which takes in a DNA sequence and generates embedding for the DNA sequence. My task is to join multiple DNA sequences one after the other. For example, consider a set of DNA sequences and a corresponding label value. I use my embedding model to generate one embedding for each DNA sequence. So, for sample 1, I have 10 DNA sequences and their corresponding embeddings. I want to feed these embeddings to an LLM and fine-tune the LLM on these embeddings instead of sequences directly. How should I go about solving this problem?

  1. I know transformers use a tokenizer to convert words to ids and their embeddings, but can’t I directly feed embeddings (which won’t be trained further) into the transformer model without going through the tokenizer route?

Hey @Palaash , I have a similar interest. Any luck achieving this direct feeding of embeddings?