Interesting. So I know Bloom is a multilanguage model to begin with, but I used 9000 blogs that were written in English to fine tune it. That’s why I’m wondering if I did something wrong in the fine tuning.
I tested text generation with “bigscience/bloom-1b7” and it returns English as I would expect. I think I setup all the things correctly:
checkpoint = “bigscience/bloom-1b7”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors=“pt”)
context_length = 256
config = AutoConfig.from_pretrained(checkpoint, vocab_size=len(tokenizer), n_ctx=context_length, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id,)
model = BloomForCausalLM(config)
I configured all the training:
ds_config_json = r’/deepspeed-config.json’
args = TrainingArguments(
output_dir=model_loc,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
evaluation_strategy=“steps”,
eval_steps=5_000,
logging_steps=5_000,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type=“cosine”,
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
deepspeed = ds_config_json,
)
trainer = Trainer(model=model, tokenizer=tokenizer, args=args, data_collator=data_collator, train_dataset=tokenized_datasets[“train”], eval_dataset=tokenized_datasets[“test”],)
I ran trainer.train() and it returned:
“Training completed. Do not forget to share your model on Models - Hugging Face =)”
Could I have not saved it properly?
trainer.save_model(model_loc)