How to fine-tune a pretrained LLM on custom code libraries?

Very, very beginner questions are okay, right?

I’m an experienced software developer, previously with Python but these days mostly in TypeScript, front and back end. I’ve developed several of my own code libraries and use lot’s of packages from NPM.

I’ve never done any AI/LLM projects, but I’d like to do a personal project to get familiar.

For my TypeScript projects, I’ve tried several Web based AI chatbots for coding advice, but at best they have provided inconsistently and often contradictory clues. Often based on outdated code they’ve been trained on.

So I thought it would be a great starter exercise for me to take a pre-trained, publicly available LLM and fine tune/train it on my own code library & current versions of the packages I use, and create my own personal AI Coding Assistant/Chatbot. I’m fine with Python, I’ve got pytorch, etc.

I get that Hugging Face provides lots of libraries, transformers, peft, etc. And I have to prepare the datasets and tokenize them and implement a fine-tuning process - PEFT, LoRA, QLoRA, ----

There are uncountable guides & tutorials online, often conflicting and focused on customizing the “Chat” features. I just want to fine-tune an LLM on particular code repositories which are already by nature structured. I’ve got hints about using a parser generator to build an AST, tokenize, etc, but nothing step by step.

But I would think that fine-tuning LLMs on custom code bases would be a REALLY COMMON use case. I’m hoping there are some established tools and processes I just haven’t found yet.

ANY tips or links or advice or whatever would be greatly appreciated.

Or if I’m over ambitious for a part-time personal challenge, that would be great to know as well.

Thanks for any feedback.

Paul

1 Like

Hi Paul,

Do you have any ideas? I think I have the same request. But What I got is only previous code files. Anyone know could I just use those files as the dataset to finetune the model?

Thanks,
Caffery

I’d recommend this blog post: Personal Copilot: Train Your Own Coding Assistant.