Hi Guys,
I am currently trying to analyze the sentiment of earnings calls using a pre-trained FinBert Model (yiyanghkust/finbert-tone · Hugging Face).
Since I want to analyze more than 40,000 earnings calls, the computation of the sentiment scores just on my notebook would take more than 2 weeks. Because of that, I want to use the TPU provided by Kaggle to accelerate this process. But I have never done that before so I don`t really know how to do that and all the tutorials/ guides I could find where just dealing with how to use the TPU to train the model, but I just want to use the pre-trained model and apply that on the earnings calls without further training.
This is my code so far. Where do I need to adjust it that it actually takes advantage of the TPU provided by Kaggle:
First, I import the transformers
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import pipeline
Then to activate the TPU in Kaggle:
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
except ValueError:
tpu = None
gpus = tf.config.experimental.list_logical_devices(“GPU”)
if tpu:
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu,)
print('Running on TPU ', tpu.cluster_spec().as_dict()[‘worker’])
elif len(gpus) > 1:
strategy = tf.distribute.MirroredStrategy([gpu.name for gpu in gpus])
print('Running on multiple GPUs ', [gpu.name for gpu in gpus])
elif len(gpus) == 1:
strategy = tf.distribute.get_strategy()
print('Running on single GPU ', gpus[0].name)
else:
strategy = tf.distribute.get_strategy()
print(‘Running on CPU’)
print("Number of accelerators: ", strategy.num_replicas_in_sync)
Then I build the model:
finbert = BertForSequenceClassification.from_pretrained(‘yiyanghkust/finbert-tone’,num_labels=3)
tokenizer = BertTokenizer.from_pretrained(‘yiyanghkust/finbert-tone’)
nlp = pipeline(“sentiment-analysis”, model=finbert, tokenizer=tokenizer)
for i in range(0,len(clean_data)-1):
print(i)
# Get QandA Text
temp = test_data.iloc[i,3]
sentences = nltk.sent_tokenize(temp)
results = nlp(sentences)
filename = clean_data.iloc[i,0]
# Count the number of positive, neutral and negative sentences in the call
j = 0
positive = 0
neutral = 0
negative = 0
for j in range (0,len(results)):
label = results[j]["label"]
if label == "Positive":
positive = positive + 1
elif label == "Neutral":
neutral = neutral + 1
else:
negative = negative + 1
# Calculate the Sentiment Scores
per_pos_qanda = positive / len(results)
per_neg_qanda = negative / len(results)
net_score_qanda = per_pos_qanda - per_neg_qanda
# save results in a DataFrame
finbert_results.iloc[i,0] = filename
finbert_results.iloc[i,7] = per_pos_qanda
finbert_results.iloc[i,8] = per_neg_qanda
finbert_results.iloc[i,9] = net_score_qanda
But if I run this code now in Kaggle with the accelerator TPU turned on it is not faster at all. So, where do I need to adjust the code to actually take advantage of the TPU?
Many thanks in advance!