Training a model to autocomplete for a niche domain and a specific style

Hi,

I’m looking to setup a “autocomplete” writing assistant that can complete my sentences/paragraphs. Kind of like Github Copilot but for my writing. Would appreciate any help or pointers of how to go about this.

Most of my writing is for a particular domain and has to conform to a particular writing style. I have about 5000 documents, each averaging a 1000 or so tokens.

Was wondering if finetuning a LORA is the way to go, and whether it should be unsupervised or supervised.

Should I just feed raw text into it? But then how to do I do inference to autocomplete? Just present the “incomplete” text and wait for it to generate the rest?

I’d also like to be able to do “infilling” where text might be missing in the middle, and the model must complete it. If unsupervised is the way to go, how would I manage that?

Or would a supervised approach be better, where I create chunks of incomplete text as the instruction and the completion as the response?

If supervised is the way to go, how many instruction-completion pairs would I need for it work. Do I need to give multiple chunks per document so the model gets what I’m trying to do, or will it be able to infer what I want it to do if I just make one chunk per document, provided I randomise how I chunk the documents?

Will a model be able to pick up sufficient knowledge of domain to actually autocomplete accurately, or would it better to train it with RAG baked into the training samples i.e. RAG context is part of the “autocomplete this” instruction? There are quite a few “definitions” and “concepts” that keep repeating in my dataset - maybe a few hundred, but like I said, they repeat with more or less standard wording through most of the documents.

Also, there’s nothing confidential about my data, so does it make sense to actually just train gpt-3.5 over a local model in terms of quality of output.

Thanks for any help.