Fine tune a BERT model in sagemaker using a custom dataset

Hi,

I have been following the tutorial in here The Partnership: Amazon SageMaker and Hugging Face to fine tune a BERT model on my own dataset in SageMaker and I am having problems understanding how the data needs to be pre-processed and sent to S3 so the train.py picks it up properly. It seems that all examples are using datasets hosted in huggingface instead of local files.

I have a dataset in a CSV file, which I read using pd.read_csv instead of the dataset library functions, and this is mainly my issue. In the tutorial, the way they load the data is this one:

# load dataset
dataset = load_dataset(dataset_name)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))

# set format for pytorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

And the way I am loading the data is this one:

dataframe = pd.read_csv(dataset_filepath, sep=',')
dataframe.Sentiment.replace(-1, 0, inplace=True)

train_dataset, test_dataset = train_test_split(dataframe, test_size=0.2)

After getting the train and test data, I require transforming the data into the torch format and then upload it to S3, and this is where I am stuck.

How do I run the tokenizer over the column in the dataframe and then transform that into torch to upload it to S3?

Thank you so much,