How to fine-tune a pretrained LLM on custom code libraries?

pkirkaas · March 17, 2024, 1:07am

Very, very beginner questions are okay, right?

I’m an experienced software developer, previously with Python but these days mostly in TypeScript, front and back end. I’ve developed several of my own code libraries and use lot’s of packages from NPM.

I’ve never done any AI/LLM projects, but I’d like to do a personal project to get familiar.

For my TypeScript projects, I’ve tried several Web based AI chatbots for coding advice, but at best they have provided inconsistently and often contradictory clues. Often based on outdated code they’ve been trained on.

So I thought it would be a great starter exercise for me to take a pre-trained, publicly available LLM and fine tune/train it on my own code library & current versions of the packages I use, and create my own personal AI Coding Assistant/Chatbot. I’m fine with Python, I’ve got pytorch, etc.

I get that Hugging Face provides lots of libraries, transformers, peft, etc. And I have to prepare the datasets and tokenize them and implement a fine-tuning process - PEFT, LoRA, QLoRA, ----

There are uncountable guides & tutorials online, often conflicting and focused on customizing the “Chat” features. I just want to fine-tune an LLM on particular code repositories which are already by nature structured. I’ve got hints about using a parser generator to build an AST, tokenize, etc, but nothing step by step.

But I would think that fine-tuning LLMs on custom code bases would be a REALLY COMMON use case. I’m hoping there are some established tools and processes I just haven’t found yet.

ANY tips or links or advice or whatever would be greatly appreciated.

Or if I’m over ambitious for a part-time personal challenge, that would be great to know as well.

Thanks for any feedback.

Paul

CafferyChu · April 29, 2024, 5:38am

Hi Paul,

Do you have any ideas? I think I have the same request. But What I got is only previous code files. Anyone know could I just use those files as the dataset to finetune the model?

Thanks,
Caffery

nielsr · April 29, 2024, 7:17am

I’d recommend this blog post: Personal Copilot: Train Your Own Coding Assistant.

yuhua · April 26, 2025, 3:14pm

Any good ideas or recommendations? thanks a lot

Topic		Replies	Views
Non-Coder Training Question Beginners	5	423	February 20, 2025
Fine-Tuning Help for Personal Project Beginners	1	67	March 28, 2025
Help with autotrain/LLM finetuning please Beginners	3	2143	August 11, 2023
Guidance on getting started with fine tuned uncensored model Beginners	2	1142	March 8, 2025
Code + tutorial on using PyTorch-Lightning + QLoRA/peft to fine tune a LLM from Huggingface Show and Tell	0	2841	November 8, 2023

How to fine-tune a pretrained LLM on custom code libraries?

Related topics