Training a model to autocomplete for a niche domain and a specific style

Reggie · February 24, 2024, 4:57pm

Hi,

I’m looking to setup a “autocomplete” writing assistant that can complete my sentences/paragraphs. Kind of like Github Copilot but for my writing. Would appreciate any help or pointers of how to go about this.

Most of my writing is for a particular domain and has to conform to a particular writing style. I have about 5000 documents, each averaging a 1000 or so tokens.

Was wondering if finetuning a LORA is the way to go, and whether it should be unsupervised or supervised.

Should I just feed raw text into it? But then how to do I do inference to autocomplete? Just present the “incomplete” text and wait for it to generate the rest?

I’d also like to be able to do “infilling” where text might be missing in the middle, and the model must complete it. If unsupervised is the way to go, how would I manage that?

Or would a supervised approach be better, where I create chunks of incomplete text as the instruction and the completion as the response?

If supervised is the way to go, how many instruction-completion pairs would I need for it work. Do I need to give multiple chunks per document so the model gets what I’m trying to do, or will it be able to infer what I want it to do if I just make one chunk per document, provided I randomise how I chunk the documents?

Will a model be able to pick up sufficient knowledge of domain to actually autocomplete accurately, or would it better to train it with RAG baked into the training samples i.e. RAG context is part of the “autocomplete this” instruction? There are quite a few “definitions” and “concepts” that keep repeating in my dataset - maybe a few hundred, but like I said, they repeat with more or less standard wording through most of the documents.

Also, there’s nothing confidential about my data, so does it make sense to actually just train gpt-3.5 over a local model in terms of quality of output.

Thanks for any help.

chiragshahcompass · August 1, 2024, 7:17pm

@Reggie I’m looking to do something simliar, curious if you made any headway here?

momo00798 · February 19, 2025, 4:05am

Following!

Topic		Replies	Views
Good pre-trained models for Document Answering tasks? Beginners	3	4792	February 20, 2024
Repost: Wikipedia (or something else) text to input output Beginners	3	273	November 18, 2024
Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues Intermediate	7	381	March 11, 2025
How to get code to replicate autotrain model Beginners	2	342	July 21, 2023
Understanding a task, and choosing a model, for text feedback Beginners	0	432	April 1, 2023

Training a model to autocomplete for a niche domain and a specific style

Related topics