What is the best approach to let LLM to learn company internal legacy system

henrycwf · March 31, 2025, 4:41pm

As captioned. I want to use LLM to learn our company’s legacy system and program logic, so that, we can use the LLM to generate code to follow existing DB and program design. What is the best common/best approach? And which modal is good for this purpose?

Thanks in advance.

John6666 · April 1, 2025, 8:51am

I think there are two approaches: one is to train a single, large LLM to memorize data, and the other is to train a somewhat clever LLM to a certain extent and then link it to a database using an RAG-like approach.

In this case, I think it is fine to choose a model that is good at coding and reasoning for the LLM used as the base. The following is an example.

If reliability is required, I think RAG is probably better.

Examples of RAG

LLMs good for coding

henrycwf · April 2, 2025, 2:43pm

Thanks for your information @John6666. According to my scenario, we have thousand of SPs + tables and hundred of frontend and backend programs. will fine-tunning open source model approach provide better results? but I am afraid it is hard to provide enough data for training. May I have your opinion. Many thanks in advance.

John6666 · April 2, 2025, 3:32pm

It is said that the number of parameters in the early ChatGPT was around 1000B. I think it would probably be reckless to teach all of that data into an open-source AI model (around 130M to 72B) that can be trained for a realistic amount of money…

In that case, if accuracy is important, it would be better to take the RAG approach.
If you are looking for someone to do the coding for you, there may be some merit in completing it within a single LLM, but it would probably be more efficient to have the LLM act as a smart librarian. I think the HF course and Smolagents are useful resources for learning about the concepts and current state of RAG and agent systems.

I don’t think there are any open examples of RAG for legacy systems that can be directly applied, but I’ll list a few that might be applicable. Especially the cookbook contains a wide variety of specific examples, so I think it would be useful to look through it briefly.

RAG is essentially just a program that uses LM or LLM as its building blocks, or a kind of batch process, so depending on your ideas, you can come up with countless combinations, so it’s hard to determine the overall picture. I think it’s a good idea to first decide what kind of overall structure you want. It’s quicker to find something with a structure similar to what you want in Spaces and then look at its source.

henrycwf · April 2, 2025, 4:59pm

Thank a lot . Let’s me have a look.

henrycwf · April 7, 2025, 4:56pm

Since I have limited documentation for our system, I want to embedding the source codes and DB schema into Vector DB. May I know which embedding model is good for coding? And what is best practice to chunk the source code to keep the semantics. I am afraid few thrusand line of codes in a source code file cannot be chunked into one embedding. Thanks in advance.

John6666 · April 8, 2025, 2:06am

I have very little knowledge of the Embedding model itself, but there are leaderboards that are sorted according to benchmark, so I think it’s a good idea to choose a good model based on that. There are various leaderboards.

The following is a general theory I heard from Hugging Chat, but the information on the model should be more reliable on the leaderboard.

by HuggingChat

Embedding source code and database schemas into a vector database is a powerful way to leverage semantic search and metadata extraction. Here’s how you can approach this task using Hugging Face models and best practices:

1. Choosing an Embedding Model for Code

Code-Specific Models: For embedding source code, models like CodeBERT, GraphCodeBERT, or Code-T5 are well-suited because they are trained on large codebases and understand syntactic and semantic structures [1]. These models are designed to handle programming languages and can capture the context of code effectively.
General-Purpose Models: If code-specific models are not available, you can use general-purpose models like Sentence Transformers (e.g., all-MiniLM-L6-v2) or BPEmb. However, they may not capture the nuances of code as well as specialized models.

2. Best Practices for Chunking Source Code

Chunking large source code files is essential because modern embedding models have input size limitations (e.g., 1024 tokens for all-MiniLM-L6-v2). Here are some strategies:

Chunk by Logical Units: Split the code into logical units like functions, methods, or classes. For example:
- Each function or method is one chunk.
- Each class is one chunk.
- Break large functions into smaller, semantically meaningful parts.
Overlap Chunks: To avoid losing context between chunks, overlap adjacent code blocks by a few lines. For example, if you split the code into chunks of 500 lines, ensure that the next chunk starts 250 lines before the end of the previous one.
Preserve Semantics: Avoid splitting code at critical points like variable declarations, function calls, or conditionals. This ensures that the chunk retains its semantic meaning.
Chunk Size: A good starting point is to use chunks of 300-500 lines of code, depending on the complexity of the codebase. However, this can be adjusted based on experimentation.

3. Storing Embeddings in a Vector Database

Once you have the embeddings, you can store them in a vector database like ChromaDB or FAISS. Here’s how:

Using ChromaDB with Hugging Face: ChromaDB supports embedding functions from Hugging Face. You can specify the embedding function when creating the collection, for example:

from chromadb import VectorDB
from sentence-transformers import SentenceTransformer

embeddings_function = SentenceTransformer('all-MiniLM-L6-v2')
db = VectorDB.create_collection(embeddings_function)

Storing Embeddings: For each chunk of code, generate its embedding and store it in the vector database. Include metadata (e.g., file name, line numbers, function names) to improve search accuracy.

4. Retrieval and Semantic Search

To retrieve relevant code chunks, you can query the vector database with a code snippet or a natural language query. The database will return the most semantically similar chunks based on cosine similarity.
For example, querying with a specific function name or code pattern will return related code chunks from your database.

5. Best Practices for Preprocessing

Remove Noise: Strip out unnecessary comments, whitespace, and boilerplate code from the source files before embedding.
Format Consistently: Ensure that the code is formatted consistently (e.g., using a code formatter like black) to improve embedding consistency.
Tokenization: Use programming language-specific tokenizers to preprocess the code into tokens before embedding. This can improve the quality of the embeddings.

6. Tools and Libraries

Hugging Face Transformers: For generating embeddings (e.g., sentence-transformers, CodeBERT).
LangChain: A framework for building LLM applications that supports vector database integration.
ChromaDB: A lightweight, in-memory vector database for storing embeddings.

By following these steps, you can effectively embed your source code and database schemas into a vector database while maintaining their semantic meaning. If you need further assistance, let me know!

Topic		Replies	Views
Advice on LLMs that can be used directly in Python Beginners	1	235	November 15, 2024
Learn LLM as beginners Beginners	2	116	March 24, 2025
Feeling lost when starting this course Beginners	1	46	June 18, 2025
Experience with and extending LLM for software engineering Intermediate	4	490	August 15, 2024
FineTune LLM on a behavior strategy Beginners	2	27	May 15, 2025