I am new to the transformer.
When I am fine-tuning for bloom-560m for phishing emails, I try to give the whole email for one label:
tokenized_inputs = tokenizer(inputs[“email”], max_length = 512, truncation=True)
word_ids = tokenized_inputs.word_ids()
label = inputs[“label”]
labels = label# phishing or not
tokenized_inputs[“labels”] = [labels]
so it should have the whole email with one label right?
but after training when I try to get the output:
inputs = tokenizer(
#“HuggingFace is a company based in Paris and New York”,
‘Thank you Katie.\nI will be with David as well.\n’,
#inputs = tokenizer(example[“email”])
logits = model_tuned(**inputs).logits
predicted_token_class_ids = logits.argmax(-1)
# Note that tokens are classified rather then input words which means that
# there might be more predicted token classes than words.
# Multiple token classes might account for the same word
predicted_tokens_classes = [model_tuned.config.id2label[t.item()] for t in predicted_token_class_ids]
result is like:
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
I got the result for each word but not the whole email.
I have tried to search the topic and found few helpful.
Could you guys advise me on this? Thanks.