Fine tune a BERT model in sagemaker using a custom dataset

cfloressuazo · November 19, 2021, 4:05am

Hi,

I have been following the tutorial in here The Partnership: Amazon SageMaker and Hugging Face to fine tune a BERT model on my own dataset in SageMaker and I am having problems understanding how the data needs to be pre-processed and sent to S3 so the train.py picks it up properly. It seems that all examples are using datasets hosted in huggingface instead of local files.

I have a dataset in a CSV file, which I read using pd.read_csv instead of the dataset library functions, and this is mainly my issue. In the tutorial, the way they load the data is this one:

# load dataset
dataset = load_dataset(dataset_name)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

And the way I am loading the data is this one:

dataframe = pd.read_csv(dataset_filepath, sep=',')
dataframe.Sentiment.replace(-1, 0, inplace=True)

train_dataset, test_dataset = train_test_split(dataframe, test_size=0.2)

After getting the train and test data, I require transforming the data into the torch format and then upload it to S3, and this is where I am stuck.

How do I run the tokenizer over the column in the dataframe and then transform that into torch to upload it to S3?

Thank you so much,

Topic		Replies	Views
Use my finetuned Bert Model in SageMaker BatchTransform Amazon SageMaker	4	2968	April 30, 2022
How to use fine tuned Hugging face model saved at S3 at inference time? Amazon SageMaker	1	5066	May 4, 2022
Creating Vision dataset with images on s3 Amazon SageMaker	9	2551	September 15, 2022
Endpoint Deployment Amazon SageMaker	1	1111	September 20, 2021
Available dataset to train run_translations.py examples 🤗Datasets	1	595	November 25, 2022

Fine tune a BERT model in sagemaker using a custom dataset

Related topics