Hi there,
I am new to the transformer.
When I am fine-tuning for bloom-560m for phishing emails, I try to give the whole email for one label:
def tokenizeInputs(inputs):
tokenized_inputs = tokenizer(inputs[âemailâ], max_length = 512, truncation=True)
word_ids = tokenized_inputs.word_ids()
label = inputs[âlabelâ]
labels = label# phishing or not
tokenized_inputs[âlabelsâ] = [labels]
return tokenized_inputs
so it should have the whole email with one label right?
but after training when I try to get the output:
inputs = tokenizer(
#âHuggingFace is a company based in Paris and New Yorkâ,
âThank you Katie.\nI will be with David as well.\nâ,
add_special_tokens=False, return_tensors=âptâ
)
#inputs = tokenizer(example[âemailâ])
with torch.no_grad():
logits = model_tuned(**inputs).logits
print(logits)
predicted_token_class_ids = logits.argmax(-1)
print(predicted_token_class_ids[0])
# Note that tokens are classified rather then input words which means that
# there might be more predicted token classes than words.
# Multiple token classes might account for the same word
predicted_tokens_classes = [model_tuned.config.id2label[t.item()] for t in predicted_token_class_ids[0]]
predicted_tokens_classes
result is like:
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
I got the result for each word but not the whole email.
I have tried to search the topic and found few helpful.
Could you guys advise me on this? Thanks.