Finetune molformer model

i am trying to train the model below on my custom dataset
model name - ibm-research/MoLFormer-XL-both-10pct

but i am experiencing a number of errors while training the model can anyone help me train this model on my dataset any script help would be very much beneficial
dataset - nypabhishek/finetunetry
file name - [Multi-Labelled_Smiles_Odors_dataset.csv]

2 Likes

If you search for each specific error on Google or Bing and fix it, it should work eventually. If you don’t understand, there’s no problem if you ask on the forum, HF Discord, or StackOverflow.

MLM Training Tips

Code example by Hugging Chat


To train the ibm-research/MoLFormer-XL-both-10pct model on your custom dataset using Hugging Face’s datasets library and Trainer, I’ll provide a step-by-step guide along with sample code. This will help you set up the dataset, preprocess it, and train the model for the fill-mask task.


1. Loading and Preprocessing the Custom Dataset

First, you need to load your custom dataset (CSV file) and preprocess it for the fill-mask task.

Sample Code:

from datasets import load_dataset
import pandas as pd
from transformers import AutoTokenizer

# Load your custom CSV dataset
def load_custom_dataset(csv_path):
    # Read the CSV file
    df = pd.read_csv(csv_path)
    
    # Ensure the dataset has columns like 'text' and 'label' or similar
    # For a fill-mask task, you might have sentences with masked words
    dataset = load_dataset('csv', data_files=csv_path)
    return dataset

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-research/MoLFormer-XL-both-10pct")

# Load your custom dataset
dataset = load_custom_dataset("your_dataset.csv")

# Preprocess the dataset
def preprocess_function(examples):
    # Tokenize the input text
    return tokenizer(examples['text'], truncation=True, padding=True)

# Apply preprocessing to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

2. Setting Up the Model and Training Arguments

Next, initialize the model and set up the training arguments using TrainingArguments.

Sample Code:

from transformers import AutoModelForMaskedLM, TrainingArguments, Trainer

# Initialize the model
model = AutoModelForMaskedLM.from_pretrained("ibm-research/MoLFormer-XL-both-10pct")

# Set up training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    seed=42,
)

3. Defining the Training Loop

Use the Trainer class to set up the training loop for the fill-mask task.

Sample Code:

# Initialize the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print(f"Perplexity: {results['eval_loss']}")

4. Fine-Tuning the Model

If you want to fine-tune the model further, you can use custom data processing or add callbacks for better control.

Sample Code:

# Fine-tune the model with custom training arguments
 trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_datasets['train'],
     eval_dataset=tokenized_datasets['test'],
 )

 # Train the model
 trainer.train()

5. Saving and Loading the Model

After training, save the model and tokenizer for future use.

Sample Code:

# Save the trained model
model.save_pretrained("./trained_model")

# Save the tokenizer
tokenizer.save_pretrained("./trained_model")

# Load the saved model for inference
loaded_model = AutoModelForMaskedLM.from_pretrained("./trained_model")
loaded_tokenizer = AutoTokenizer.from_pretrained("./trained_model")

6. Evaluating the Model

Evaluate the model on the test dataset to measure its performance.

Sample Code:

# Evaluate the model
evaluation_results = trainer.evaluate()

# Print evaluation metrics
print(f"Perplexity: {evaluation_results['eval_loss']}")

7. Testing the Model

Once the model is trained, you can test it on custom inputs.

Sample Code:

# Test the model
input_text = "This is a [mask] example for testing."
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)

# Get the predicted token IDs
predicted_ids = torch.argmax(outputs.logits, dim=2)
decoded_tokens = tokenizer.decode(predicted_ids[0])
print("Predicted text:", decoded_tokens)

Final Notes

  • Make sure your CSV dataset is formatted correctly, with each row containing sentences and masked words suitable for the fill-mask task.
  • Ensure that you have the correct tokenization and padding settings for your dataset.
  • Consider using DataCollatorForLanguageModeling if you need more control over the masking process.

If you encounter specific errors, feel free to provide more details, and I can help troubleshoot them!

For further reference, you can consult the official Hugging Face documentation on datasets and trainers.

thanks . @John6666

1 Like