A criticism of instruction fine-tuning datasets

ChatGPT has taken the world by storm, and will go down in history as one of the most important showpieces in the development of AI. However, it has created an unhealthy obsession with chat bots that is hindering the true potential of open-source language models. Allow me to clarifty.

A fun demonstration of the abilities of chat bots is to ask them questions about their opinions. Withing many instruction fine-tuning datasets there are many questions that rely on the LLM’s general knowledge. An example from Databricks Dolly-15k is “Why can camels survive for long without water?” Within the context of fine-tuning, what does this teach the language model?

What value does this kind of instruction provide for the language model? For business applications you need instructions like “generate a title based on [keywords, extracted phrases, full text]” or “given this data [summarise, write something, convert to some form”].

We really need to distinguish between chat bot behaviour (requiring large general knowledge) and language models for business applications (practical task based on information provided). They are both useful in their own context, but businesses do not need to ask a chat bot for opinions, they need their workloads reduced.

1 Like

Strongly agree. For what it’s worth, I’ve been using the Dolly-15k dataset in a heavily filtered manner (mixed with other datasets). If you filter by task type the examples become less about opinion and more about performing a task. But still, the quality is mediocre at best.

I would love to see more high quality instruction datasets where all the questions were answerable using strictly the context and common sense.

I use some old BART summarisation models during development because inference is very fast and the quality is good enough for proof of concepts. I bring this up because it is based on open datasets (xsum and CNN, example sets of articles and their human-created summaries).

If I may have one more criticism of instruction fine-tuning datasets is that they are all reinventing the wheel. There are old school datasets from a time where transformers were being trained for single purposes. As far as I know, no one has ever pulled these together because the original idea was to distil knowledge from ChatGPT. Dolly, bless its creators’ souls, is literally reinventing the wheel with some of their tasks, and the dataset is small as a result.

I don’t have time for it myself, so I’m putting the idea out there. Include old school datasets in your instruction fine-tuning data. The state of the summarisation capacity of most recent models is shocking (3B parameter and below): the old school BART (around 1B parameters) outperforms all of the LaMini models and all the Evol-Instruct models on summarisation, for example. These deficiencies have to have an expression on larger models tuned with the same datasets too.

Another advantage to this approach is that many of the single-task datasets were created with business implementations in mind - before the chat bot craze. So the dataset you get by adapting them into a single instruction-based dataset is certain to have relevant functionality, and then you can add synthetic data on top for flavour and balance.