Finetuning on a recent topic/domain

Hi,

I’m trying to learn and understand as most of possible the language models but something remains unclear to me. Assuming I want my LLM such as BLOOM be aware of recent events, let’s say the FIFA world cup 2022. As far as know, BLOOM was trained with data up to July 2022 so its knowledge about how the cup went is very limited. I can do prompt-engineering such as providing some context but it’s not good as I want and the context window is restricting.

The solution would be finetuning the model but it’s hard to me to clearly understand how to collect the data.

  1. If a scrap the webpage of wikipedia about the world cup and finetune the model on it, would it be sufficient ? And then if I need a chabot I can finetune again with the alpaca or vicuna dataset.
  2. A lot of tutorials and blog posts deal with some instruction datasets, but in my case why would I need such format ?

Thanks for your hints

Hi @Alex21j!

I can think of a few naive ways to test this. The first, with respect to data collection, I think it might depend on what your end task is. If a dataset doesn’t already exist, you may have to create one. You could scrape several websites (like wikipedia or FIFA) and collect all the text related to the 2022 world cup. One would then need to format that data appropriately for the task.

Let’s say you were interested in being able to ask questions to your finetuned model. One would need to format the collected data for the question answering task, then finetune BLOOM. Unfortunately I cannot think of good process for evaluating the goodness of the finetuning. Maybe others on here have something they can share. A naive approach would be to ask the finetuned model “Who won the 2022 FIFA world cup” and see what the response is. As this is more anecdotal, it’s not a very quantitative means for evaluating how well the finetuned model responds to questions about the 2022 world cup.

With respect to your second question, what I understand this to be is a dataset format that provides the model with data that is formatted in a more conversational tone. Taking the example above, you could prompt the model with “Summarize the 2022 FIFA world cup”. Ideally it would give you a summary of the game, the participants, who won, and what the score was. I don’t know this to be the case, but it’s what I could infer from reading the cleaned alpaca dataset github.

Lastly, I should mention that I don’t have any experience with BLOOM. Most of what I have dealt with in language modeling comes from finetuning GPT2. I also found the Tasks page on the HF site to be very insightful. Maybe there is something better there that suits your needs.

Apologies I don’t have better insight, but I hope the above is useful.

Hi @aclifton314,

That’s a lot of insights, it makes much more sense, thanks !
So now I’m trying to understand if it’s worth building a QA dataset.
If I specialize a LLM at a low cost just by finetuning it with some articles or wikipedia pages in raw text and then use few-shot QA, would it be sufficient ?
I’m also wondering what could be the effect of finetuning an chatbot such as vicuna with raw text. Any chances than the conversational mode will be lost after finetuning ?
It’s kinda hard to evaluate the benefit of building a QA/conversational dataset instead of “simply” finetuning the model with domain-specific raw texts.