Text generation returning multiple languages

I’ve fine tuned a Bloom1b7 model with 9000 proprietary blog posts. When I generate text with this model it returns random languages. I’m trying to narrow down where the failure is. Did I do the text generation wrong? Did I do the training wrong? I was hoping someone has seen this before and knows in general where the failure might be. I’m not even sure what I would need to post to help y’all diagnose the problem.

My limitations: I took the blogs from my company and I don’t want to post them publicly since that might get me in trouble. I’m also wary of posting the fine-tuned model for the same reasons. This is a personal project for me to learn about NLP.

Hi @telavir ,
Thanks for the issue and great job on fine-tuning BLOOM-1B7!
As far as I know there is no explicit arg that you can use/turn on to force a generative model to generate text in a specific language, given that you have trained your model in a mixture of languages.

Note that from a model’s perspective, it only sees numbers at input and tokens of the vocabulary are ordered in a random order (though I am slightly unsure regarding the last point).

One thing that you can try is “prompt engineering”. Your model has been fine-tuned on next-token prediction (if done correctly), so a sanity check that you can try is to give an input that gives enough context to your model about the language and the content of the post. If let’s say that I have fine-tuned a BLOOM version on Stack overflow, I can try ask the model to generate a sentence given the prompt:

I successfully trained the network but got this error during validation: RuntimeError: CUDA error: out of memory. Can anyone help me?

The model should generate a consistent answer, in this case something close to:

The error occurs because you ran out of memory on your GPU.
One way to solve it is to reduce the batch size until your code runs without this error.

(taken from here).
If my model saw similar posts in French, I can try something like:

J'ai réussi à entraîner mon modèle avec succès mais lors de la validation j'obtiens: RuntimeError: CUDA error: out of memory. Est-ce que quelqu'un peut m'aider sur ça?

And hopefully the model can produce the correct answer in French.

So my advice here is to try various combinations of prompts, and do your best to “convince” your model to predict text in your desired language by giving it enough context (e.g. long enough text similar to your training data in the desired language).

1 Like

Interesting. So I know Bloom is a multilanguage model to begin with, but I used 9000 blogs that were written in English to fine tune it. That’s why I’m wondering if I did something wrong in the fine tuning.

I tested text generation with “bigscience/bloom-1b7” and it returns English as I would expect. I think I setup all the things correctly:

checkpoint = “bigscience/bloom-1b7”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors=“pt”)
context_length = 256
config = AutoConfig.from_pretrained(checkpoint, vocab_size=len(tokenizer), n_ctx=context_length, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id,)
model = BloomForCausalLM(config)

I configured all the training:

ds_config_json = r’/deepspeed-config.json’
args = TrainingArguments(
output_dir=model_loc,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
evaluation_strategy=“steps”,
eval_steps=5_000,
logging_steps=5_000,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type=“cosine”,
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
deepspeed = ds_config_json,
)

trainer = Trainer(model=model, tokenizer=tokenizer, args=args, data_collator=data_collator, train_dataset=tokenized_datasets[“train”], eval_dataset=tokenized_datasets[“test”],)

I ran trainer.train() and it returned:
“Training completed. Do not forget to share your model on Models - Hugging Face =)”

Could I have not saved it properly?

trainer.save_model(model_loc)