Returning score associated with prediction_value from loaded_tokenizer

trevorwooddmu · October 9, 2024, 4:33pm

Hi. I’m trying to get the %age score corresponding to prediction_value using the following python code snippet

predict_input = loaded_tokenizer.encode(email_text, truncation=True, padding=True, return_tensors="tf")
output = loaded_model(predict_input)[0]

prediction_value = tf.argmax(output, axis=1).numpy()[0]

loaded_model and loaded_tokenizer have been trained earlier using distilbert.

I’ve tried using TextClassificationPipeline, but unfortunately “email_text” has too many tokens (more than 512) so doesn’t work. If I truncate email_text it returns incorrect prediction_value.

Thanks in advance

John6666 · October 10, 2024, 4:32am

I searched to see if it was some kind of bug or configuration issue, but it seems to be a deep-rooted problem.
By the way, you can pass options to the pipeline for the model and tokenizer in addition to the pipeline options. Check the model description to see what options are available. It would be easy to solve this problem.

github.com/huggingface/transformers

processing texts longer than 512 tokens with token-classification pipeline

opened 12:58PM - 17 Jan 22 UTC

closed 03:07PM - 24 Feb 22 UTC

oliverguhr

# 🚀 Feature request Currently, the token-classification pipeline truncates in…put texts longer than 512 tokens. It would be great if the pipeline could process texts of any length. ## Motivation This issue is a result of this feature request: huggingface/hub-docs#11 As suggested I am tagging @Narsil here. ## Your contribution For my [punctuation prediction model](https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large), I wrote an inference code for this task. This code segments a longer text into 512 tokens chunks with some overlap and reconstructs the final output string. The code however is rather complex and tailored to my punctuation prediction use case. So I am not sure if this is useful here.

trevorwooddmu · October 11, 2024, 11:39am

Thank you. It haddn’t occured to me that the input could be chunked. Although it’ll make the classification slower, if it makes it more accurate, that’ll be really helpful. I’ll look into it

Topic		Replies	Views
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	334	September 29, 2021
Looking for tool class to do predictions 🤗Transformers	3	565	October 9, 2020
My input sentence is very long(more than 512). What should I do when I want to fintune model about classify?Thanks Beginners	3	1120	September 3, 2021
How to stop at 512 tokens when sending text to pipeline? 🤗Transformers	2	1522	February 7, 2024
Sentiment analysis for long sequences Beginners	3	2318	December 7, 2020

Returning score associated with prediction_value from loaded_tokenizer

Related topics