Fine-Tune BART using "Fine-Tuning Custom Datasets" doc

I am trying to fine-tune BART for a summarization task using the code on the “Fine Tuning with Custom Dataset” page (https://huggingface.co/transformers/custom_datasets.html). The data is a subset of the CNN/Daily Mail data.

I am encountering two different errors. The first comes when I implement the exact code on the page: "TypeError: new(): invalid data type ‘str’

I assume this is caused by the fact that the labels are not encoded/tokenized, instead they are strings. I see this is the case in the sample code on the page, the labels are not tokenized.

If I tokenize the labels and run the code I receive a different error: “Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers”

I am not sure how to interpret that. I get the same errors whether I am using the Trainer class or using native PyTorch. Any suggestions? Thanks!

It’s a bit hard to help you without seeing the code you run.

Hi @Buckeyes2019,

Not a direct answer to your question, but you can use the scripts in examples/seq2seq here (finetune.py or finetune_trainer.py) for fine-tuning BART and other s2s models. It supports custom datasets as well. All you’ll need to do is get the data in the required format mentioned in the redme.

Sorry, here is the code that produces the error: “Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers”

<—>
from sklearn.model_selection import train_test_split
from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained(‘facebook/bart-large-cnn’)
train_texts, val_texts, train_labels, val_labels = train_test_split(df.articles, df.highlights, test_size=.2)
train_texts = train_texts.values.tolist()
train_labels = train_labels.values.tolist()
val_texts = val_texts.values.tolist()
val_labels = val_labels.values.tolist()

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_label_encodings = tokenizer(train_labels, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
val_label_encodings = tokenizer(val_labels, truncation=True, padding=True)

import torch

class PyTorchDatasetCreate(torch.utils.data.Dataset):
def init(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

def __len__(self):
    return len(self.labels)

train_dataset = PyTorchDatasetCreate(train_encodings, train_label_encodings)
val_dataset = PyTorchDatasetCreate(val_encodings, val_label_encodings)

from transformers import BartForConditionalGeneration, Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir=’./results’, # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training
per_device_eval_batch_size=1, # batch size for evaluation
warmup_steps=200, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=’./logs’, # directory for storing logs
logging_steps=10)

model = BartForConditionalGeneration.from_pretrained(“sshleifer/distilbart-cnn-12-6”)

trainer = Trainer(
model=model, # the instantiated :hugs: Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset) # evaluation dataset

trainer.train()

<---->
The code for the other error ("TypeError: new(): invalid data type ‘str’) is identical except the labels (train and val) are not tokenized. Appreciate your help!

Is the correct format the following:
6 files

  • train.source
  • train.target
  • val.source
  • val.target
  • test.source
  • test.target

Each with one text per line in the .source file and the corresponding summarization in the same line in the .target file?

Is there a better documentation what else need to be changed (e.g. data_dir or output_dir)?
In the readme it only states: “you need to specify data_dir, output_dir and model_name_or_path”.
Or does anyone have a public respo where they finetuned one model?

I was able to get this to work: https://ohmeow.com/posts/2020/05/23/text-generation-with-blurr.html

1 Like

Thanks for the link! But how can you use a custom dataset on that apporach?
They use a :hugs: dataset. In the official documentation they are using a single csv file for training: pd.read_csv(path/'cnndm_sample.csv'), but shouldn’t you use at least train and validation?

Edit:
The answer is:
dblock = DataBlock(blocks=blocks, get_x=ColReader('article'), get_y=ColReader('highlights'), splitter=RandomSplitter())
They split the single csv file into two dataloader object.