I’m working on a project that is 50% to solve a problem and 50% for me to learn LLMs. Let’s say I have a database of 1k widgets and each widget has 1k parts (meaning 1 million parts total). Each widget has an id, name and description. Each part has an id, its parent widget id, name and a description.
I want to create a model where the user asks for a widget and list of parts and the model spits out the id of the widget and list of ids of the parts, up to around 40 parts. (The user won’t see the ids. The ids get processed into something the user eventually sees.) I don’t need to retain any other LLM knowledge/functionality.
My understanding of the fine-tuning process (from the new course here on Hugging Face and other resources) is that I need to create a large list of question/answer pairs for fine-tuning. I was planning to use inference on an LLM to generate the questions, but I’m facing several questions:
- I was planning to start with Qwen2.5-Coder-1.5B as I’m coding/learning and then scale up to larger versions of Qwen and/or test other base models. Is that a good approach?
- It seems like each widget/part id should be a new token, but that would add over a million tokens. Is that feasible? Is there a better way to train an LLM on a million distinct “things” without adding tokens to vocabulary?
- I was planning to use inference on some LLM to generate my question/answer pairs using the widget/part descriptions as input. Some parts are much more common than others, so I was planning to generate more questions for parts that are more commonly used. Does this approach seem reasonable?
- Initially I planned to have my question/answer pairs refer to a single part, but I’m wondering if the model will be able to create lists of (up to 40) parts if I tune like that. Would my training data need to have questions that reference multiple parts? If I start creating combinations of parts it seems like I would need quadrillions of records. Is it possible to fine-tune on a few records per part, but have the model generate lists of records?
- Should I scrap the idea of using LLMs for this and look at approaches based on semantic search?
Thanks for reading this far. I appreciate any advice.
1 Like
Is that a good approach?
I think so.
It seems like each widget/part id should be a new token, but that would add over a million tokens. Is that feasible? Is there a better way to train an LLM on a million distinct “things” without adding tokens to vocabulary?
To address the challenge of handling a million unique IDs without adding each as a separate token in the LLM’s vocabulary, the following structured approach is proposed:
Solution Approach
-
Use Special Tokens for ID Boundaries:
- Introduce special tokens such as
[ID_START]
and [ID_END]
to encapsulate the generated IDs within the model’s output. This helps the model understand where the IDs begin and end, allowing it to generate the correct sequence of IDs without needing to tokenize each ID individually.
-
Implement a Predictable Output Template:
- Create a structured template for the output, such as “Widget ID: X, Parts: Y, Z, …”. This template guides the model to generate the IDs in a consistent format, reducing the need for it to learn each ID as a separate token.
-
Leverage Subword Tokenization:
- Utilize a subword tokenizer (e.g., BPE) to tokenize IDs into existing tokens. For example, an ID like “12345” can be tokenized into “1”, “2”, “3”, “4”, “5”. During post-processing, these tokens can be combined to reconstruct the original ID.
-
Fine-Tuning with Multiple ID Examples:
- Include training examples where the model outputs multiple IDs. This helps the model learn to generate lists of IDs, allowing it to produce up to 40 parts without needing all combinations in the training data.
-
Embedding Mapping for IDs:
- Explore mapping each ID to a specific embedding. This enables the model to recognize and generate these embeddings, potentially through additional layers or mechanisms within the model.
-
Integrate a Semantic Search Engine:
- Combine the LLM with a semantic search engine for efficient ID retrieval. The LLM handles natural language understanding, while the search engine retrieves relevant IDs, creating a scalable solution.
-
Optimize Model Size and Performance:
- By managing ID representation without adding tokens, maintain the model’s size and inference speed, crucial for deployment and performance.
-
Custom Evaluation Metrics:
- Develop custom metrics or scripts to evaluate the model’s ability to generate correct ID sequences, ensuring accurate assessment without relying on token-level evaluation.
Conclusion
The proposed solution efficiently handles the challenge of managing a million unique IDs without bloating the LLM’s vocabulary. By using special tokens, structured templates, and integrating external retrieval mechanisms, the model can generate accurate ID outputs while maintaining manageability and performance.
1 Like
Wow, this is very detailed and seems like exactly what I need. Thanks so much!
Item 6 is particularly interesting, though I’m not sure right off how that would work. But I will think it through. Did you imagine the the semantic search would go first and pass the IDs to the LLM as context or the LLM would go first and the semantic search engine would use the words generated by the LLM to search for IDs?
1 Like
I turned on the search function (purple globe icon) and asked this guy, and he answered.
It’s quite useful, so it’s good for searching.
I’m not sure this is an llm problem. Well, maybe the english language query that helps your turn a question into sql to find the parts, but once you have the part, it’s sub parts are not a need for llm, just sql.
2 Likes
Regardless, thanks for the help. I suspected that AI was used because of how the answer was structured. After I read your answer I pasted my question into Grok and the answer was similar in structure to yours, but it wasn’t useful at all.
2 Likes
I’m also not sure if this is an LLM problem. It seems like a semantic search problem, but I’m not sure how semantic search will deal with a list. I may just need to do trial and error. Since this is a learning opportunity that may not be a bad thing.
2 Likes
The second half is a summary of the search results made by AI.
Take a look at DataChat.ai and https://corraldata.com/
Don’t know if you are looking to solve a problem or have a pet project to learn from. But these tools are custom designed to answer data questions.