Fine-Tune BART using "Fine-Tuning Custom Datasets" doc

Buckeyes2019 · October 19, 2020, 2:34pm

I am trying to fine-tune BART for a summarization task using the code on the “Fine Tuning with Custom Dataset” page (https://huggingface.co/transformers/custom_datasets.html). The data is a subset of the CNN/Daily Mail data.

I am encountering two different errors. The first comes when I implement the exact code on the page: "TypeError: new(): invalid data type ‘str’

I assume this is caused by the fact that the labels are not encoded/tokenized, instead they are strings. I see this is the case in the sample code on the page, the labels are not tokenized.

If I tokenize the labels and run the code I receive a different error: “Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers”

I am not sure how to interpret that. I get the same errors whether I am using the Trainer class or using native PyTorch. Any suggestions? Thanks!

sgugger · October 19, 2020, 3:07pm

It’s a bit hard to help you without seeing the code you run.

valhalla · October 19, 2020, 3:15pm

Hi @Buckeyes2019,

Not a direct answer to your question, but you can use the scripts in examples/seq2seq here (finetune.py or finetune_trainer.py) for fine-tuning BART and other s2s models. It supports custom datasets as well. All you’ll need to do is get the data in the required format mentioned in the redme.

Buckeyes2019 · October 19, 2020, 4:54pm

Sorry, here is the code that produces the error: “Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers”

<—>
from sklearn.model_selection import train_test_split
from transformers import BartTokenizer

tokenizer = BartTokenizer.from_pretrained(‘facebook/bart-large-cnn’)
train_texts, val_texts, train_labels, val_labels = train_test_split(df.articles, df.highlights, test_size=.2)
train_texts = train_texts.values.tolist()
train_labels = train_labels.values.tolist()
val_texts = val_texts.values.tolist()
val_labels = val_labels.values.tolist()

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_label_encodings = tokenizer(train_labels, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
val_label_encodings = tokenizer(val_labels, truncation=True, padding=True)

import torch

class PyTorchDatasetCreate(torch.utils.data.Dataset):
def init(self, encodings, labels):
self.encodings = encodings
self.labels = labels

def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

def __len__(self):
    return len(self.labels)

train_dataset = PyTorchDatasetCreate(train_encodings, train_label_encodings)
val_dataset = PyTorchDatasetCreate(val_encodings, val_label_encodings)

from transformers import BartForConditionalGeneration, Trainer, TrainingArguments

training_args = TrainingArguments(
output_dir=’./results’, # output directory
num_train_epochs=3, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training
per_device_eval_batch_size=1, # batch size for evaluation
warmup_steps=200, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir=’./logs’, # directory for storing logs
logging_steps=10)

model = BartForConditionalGeneration.from_pretrained(“sshleifer/distilbart-cnn-12-6”)

trainer = Trainer(
model=model, # the instantiated Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset) # evaluation dataset

trainer.train()

<---->
The code for the other error ("TypeError: new(): invalid data type ‘str’) is identical except the labels (train and val) are not tokenized. Appreciate your help!

993 · October 26, 2020, 1:20pm

Is the correct format the following:
6 files

train.source
train.target
val.source
val.target
test.source
test.target

Each with one text per line in the .source file and the corresponding summarization in the same line in the .target file?

Is there a better documentation what else need to be changed (e.g. data_dir or output_dir)?
In the readme it only states: “you need to specify data_dir, output_dir and model_name_or_path”.
Or does anyone have a public respo where they finetuned one model?

Buckeyes2019 · October 26, 2020, 2:57pm

I was able to get this to work: https://ohmeow.com/posts/2020/05/23/text-generation-with-blurr.html

993 · October 28, 2020, 10:33am

Thanks for the link! But how can you use a custom dataset on that apporach?
They use a dataset. In the official documentation they are using a single csv file for training: pd.read_csv(path/'cnndm_sample.csv'), but shouldn’t you use at least train and validation?

Edit:
The answer is:
dblock = DataBlock(blocks=blocks, get_x=ColReader('article'), get_y=ColReader('highlights'), splitter=RandomSplitter())
They split the single csv file into two dataloader object.

Topic		Replies	Views
[Beginner] fine-tune Bart with custom dataset in other language? Beginners	2	3211	January 22, 2021
Fine-tuning BERT for code translation Beginners	0	760	July 7, 2023
Facebook BART Fine-tuning - Transformers CUDA error: CUBLAS_STATUS_NOT_INITIALIZE 🤗Transformers	4	1760	May 2, 2023
Where to find documentation on dataset format for finetuning Beginners	0	273	October 7, 2023
Finetuning BART for Abstractive Text Summarisation Beginners	1	5160	September 9, 2024

Fine-Tune BART using "Fine-Tuning Custom Datasets" doc

Related topics