How to fine-tune a Mistral-7B model for machine translation?

alvations · March 13, 2024, 9:02pm

There’s a lot of tutorials online that uses raw text affix with arcane syntax to indicate document boundary and accessed through Huggingface datasets.Dataset object through the text key. E.g.

from datasets import load_dataset

dataset_name = "mlabonne/guanaco-llama2-1k"

dataset = load_dataset(dataset_name, split="train")
dataset["text"][42]

[out]:

<s>[INST] ¿Cuáles son los actuales presidentes de la región de Sur América? Enumérelos en una lista con su respectivo país. [/INST] A fecha del 13 de febrero de 2023, estos son los presidentes de los países de Sudamérica, según Wikipedia:
-Argentina: Alberto Fernández
-Bolivia: Luis Arce
-Brasil: Luiz Inácio Lula da Silva
-Chile: Gabriel Boric
-Colombia: Gustavo Petro
-Ecuador: Guillermo Lasso
-Paraguay: Mario Abdo Benítez
-Perú: Dina Boluarte
-Uruguay: Luis Lacalle Pou
-Venezuela: Nicolás Maduro
-Guyana: Irfaan Ali
-Surinam: Chan Santokhi
-Trinidad y Tobago: Paula-Mae Weekes </s>

But machine translation datasets are usually structured in 2 parts, source and target text with sentence_eng_Latn and sentence_deu_Latn keys, e.g.


valid_data = load_dataset("facebook/flores", "eng_Latn-deu_Latn", streaming=False, 
                          split="dev")
valid_data[42]

[out]:

{'id': 43,
 'URL': 'https://en.wikinews.org/wiki/Hurricane_Fred_churns_the_Atlantic',
 'domain': 'wikinews',
 'topic': 'disaster',
 'has_image': 0,
 'has_hyperlink': 0,
 'sentence_eng_Latn': 'The storm, situated about 645 miles (1040 km) west of the Cape Verde islands, is likely to dissipate before threatening any land areas, forecasters say.',
 'sentence_deu_Latn': 'Prognostiker sagen, dass sich der Sturm, der etwa 645 Meilen (1040 km) westlich der Kapverdischen Inseln befindet, wahrscheinlich auflösen wird, bevor er Landflächen bedroht.'}

But how do folks fine-tune a Mistral-7b model for the machine translation task?

Is there some recipe somewhere to manipulate multi-parts inputs like in machine translation to make them fit Mistral’s expected format?

alvations · March 13, 2024, 9:02pm

Also asked on python - How to fine-tune a Mistral-7B model for machine translation? - Stack Overflow

Topic		Replies	Views
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	501	February 2, 2024
Creating a Generalised model for translation using Mistral 7b Instruct Models	0	137	August 31, 2024
Mistral 7B FineTuning with Interview Data Models	4	6128	March 5, 2024
Dataset curation extra parameters Beginners	2	31	January 19, 2025

How to fine-tune a Mistral-7B model for machine translation?

But how do folks fine-tune a Mistral-7b model for the machine translation task?

Is there some recipe somewhere to manipulate multi-parts inputs like in machine translation to make them fit Mistral’s expected format?

Related topics