I have finetuned an ELECTRA model on some data using the PyTorch framework, and now I wish to apply my model to some text data.
First I load my best model according to validation loss:
trained_model = CrowdCodedTagger.load_from_checkpoint(
trainer.checkpoint_callback.best_model_path,
n_classes=len(LABEL_COLUMNS)
)
trained_model.eval()
trained_model.freeze()
Then I load my data which looks like this head(5):
tweet_id user_username text created_at user_name user_verified sourcetweet_text
443011743288393728 jahimes People are now using @metronorth like a subway... 2014-03-10T13:13:25.000Z Jim Himes True NaN
43011451142537216 jahimes Spent morning on @metronorth issues with Rep. ... 2014-03-10T13:12:15.000Z Jim Himes True NaN
442389699978862592 jahimes Will be interesting to see how that St. Patric... 2014-03-08T20:01:38.000Z Jim Himes True NaN
442387206767136768 jahimes Step dancing, boiled meat, and beer at the Hib... 2014-03-08T19:51:43.000Z Jim Himes True NaN
442356433993363458 jahimes What a reception for #Team26 in Greenwich! htt... 2014-03-08T17:49:27.000Z Jim Himes True NaN
I make the text variable into a list
congress_head_list = congress_head['text'].tolist()
And now trouble occurs. I want to add my models prediction (in probabilities) of each sentence (text) to the dataframe i new columns. So far I’ve come up with
def run_model(input_data):
# tokenize list
encoding = tokenizer.encode_plus(
congress_head_list,
add_special_tokens=True,
max_length=512,
return_token_type_ids=False,
padding="max_length",
return_attention_mask=True,
return_tensors='pt',
)
# returning probability values for each label
_, test_prediction = trained_model(encoding["input_ids"], encoding["attention_mask"])
return test_prediction.flatten().numpy()
#Then, apply the function to each row:
congress_head[LABEL_COLUMNS] = congress_head[['text']].apply(run_model, axis=1, result_type='expand')
But the resulting data frame contains the same probabilities for each sentence, something like:
tweet_id user_username text created_at user_name user_verified sourcetweet_text morality_binary emotion_binary ... negative_binary care_binary fairness_binary authority_binary sanctity_binary harm_binary injustice_binary betrayal_binary subversion_binary degradation_binary
443011743288393728 jahimes People are now using @metronorth like a subway... 2014-03-10T13:13:25.000Z Jim Himes True NaN 0.094099 0.119907 ... 0.098311 0.045509 0.045513 0.044468 0.037584 0.045439 0.051038 0.034683 0.047893 0.053268
443011451142537216 jahimes Spent morning on @metronorth issues with Rep. ... 2014-03-10T13:12:15.000Z Jim Himes True NaN 0.094099 0.119907 ... 0.098311 0.045509 0.045513 0.044468 0.037584 0.045439 0.051038 0.034683 0.047893 0.053268
Can anyone see what I’m doing wrong? And it doesn’t seem to return probability values either. I can give you a link to the colab where I’m doing the coding if that’s helpful.
Thanks in advance!