Train llama on domain specific dataset and on instruction format dataset

I want to create a chatbot that will speak the Bulgarian language. So far there are some llama2 fine-tuned versions in Bulgarian but they are not very good at understanding prompts written in Bulgarian.
I was thinking of fine-tuning the Bulgarian fine-tuned version with domain-specific data and then fine-tuning again with {“instruction”: <>, “context”: <>, “response”: <>} so it can be used for chatbot. So my question is, is this a good practice, or should I fine-tune the original llama2 with these separate datasets? I ask because I see there are models trained for text generation and others for chat and now I want to do it in a two-step fine-tuning process.

I was thinking about that for Polish. Few thoughts:

  1. I assume you mean fine-tuning with LoRA and not the llama2 itself

  2. If Llama2 has not been trained on significant portion of a foreign language then fine-tuning a chat model with LoRA might not help a lot - I recently read a paper that says exactly this, don’t remember where.

  3. People still doing this, not sure how successful, for example: davidkim205/komt-Llama-2-13b-hf-lora · Hugging Face

  4. If you want to train base model (a not a chat model) to include more Bulgarian, then you have to go through RLHF, which is difficult and costly.

1 Like