Slow speed when using a fine-tuned bert for prediction

I am currently experimenting with sentence prediction using different bert models. In particular, I have a training corpus of around 4000 (binary) classified tweets. I already fine tuned BERT and BERTweet and my goal now is to use them for predicting a new inflow of tweets.

My main issue is in terms of performance. When loading the stored models and using them to predict a new corpus of tweets (of about the same size as the one I fine tunned the models), I experience extremely slow speed. I haven’t timed it but I can tell that the fine tunning took around 5-10 mins using the tensorflow-metal on a macbook pro M1, and the prediction stage can easily take 2 hours… I am assuming that this has something to do with the way I am storing/loading the models.

Here is a snippet of the code I use to do so (notice that I omit the data preprocessing as it is done in the same fashion as in the Hugging face tutorials) :

        model = TFAutoModelForSequenceClassification.from_pretrained(
            'bert-base-uncased', num_labels=2)
        num_epochs = 3
        num_train_steps = len(tf_train_dataset) * num_epochs
        lr_scheduler = PolynomialDecay(
            initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
        )
        opt = Adam(learning_rate=lr_scheduler)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
        with tf.device('/gpu:0'):
            model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])
            click.secho('Fine tunning the BERTbase model',fg='yellow',bold=True)
            model.fit(
                tf_train_dataset,
                validation_data=tf_test_dataset,
                epochs=num_epochs+1,
                class_weight=class_weights
            )
         

Now to save it and load it :

model.save_pretrained(bert_name)
model = TFAutoModelForSequenceClassification.from_pretrained(f"{cd_models}/{bert_name}")
tokenizer = AutoTokenizer.from_pretrained(
            "bert-base-uncased", padding='max_length', truncation=True)

The predictions go in the same fashion as the training stage ( perhaps I am committing some mistake here) :

click.secho('Tokenizing with bertokenizer',fg='blue',bg='white')
        dataset =  Dataset.from_pandas(X1[['tidy_tweet']])
        def tokenize_function(example):
            return tokenizer(example["tidy_tweet"], truncation=True)
        tokenized_dataset = dataset.map(tokenize_function, batched=True)
        data_collator = DataCollatorWithPadding(
            tokenizer=tokenizer, return_tensors="tf")
        tf_dataset = tokenized_dataset.to_tf_dataset(
            columns=["attention_mask", "input_ids", "token_type_ids"],
            collate_fn=data_collator,
            batch_size=1,
            shuffle=False
        )
        import tensorflow as tf
        import numpy as np
        # I changed this to cpu to check if it is faster
        with tf.device('/gpu:0'):
            preds = model.predict(tf_dataset)

            
     

Thanks in advance :slightly_smiling_face:

EDIT: I just tried to do it on the cpu rather than gpu and time is still not extremely fast but it improved a lot. Maybe gpu does not work that well in this case?