How to get hidden states when using custom Pipeline?

Hey, i am using this model (“keyphrase-extraction-kbir-inspec”) to generate keyphrases from text. I intend to deploy it later on.

I intent to use the keywords it generates and also the encoder output embeddings. I saw that there is a config which has output_hidden_states:true, but i dont know how to get to the hidden states. Maybe i am missunderstanding the inheritance.

from transformers import (
    TokenClassificationPipeline,
    AutoModelForTokenClassification,
    AutoTokenizer,
)

# Define keyphrase extraction pipeline
class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model_path),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def postprocess(self, model_outputs):
        results = super().postprocess(
            model_outputs=model_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return np.unique([result.get("word").strip() for result in results])

model_path = "keyphrase-extraction-kbir-inspec"
extractor = KeyphraseExtractionPipeline(model=model_path)

Even though the config is apparently correctly loaded (output_hidden_states:true), the model_output contains no hidden states.

Hi there! As far as I know, pipelines don’t natively accept the output_hidden_states parameter, since they’re designed to return the post-processed model outputs. (I might be wrong!)

You could do this without the pipeline interface with something like:

model_path = "ml6team/keyphrase-extraction-kbir-inspec"
config = AutoConfig.from_pretrained(model_path, output_hidden_states=True)
model = AutoModelForTokenClassification.from_pretrained(model_path, config=config)
tokenizer=AutoTokenizer.from_pretrained(model_path)

inputs = tokenizer("""
Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a document.  Thanks to these keyphrases humans can understand the content of a text very quickly and easily without reading  it completely. Keyphrase extraction was first done primarily by human annotators, who read the text in detail  and then wrote down the most important keyphrases. The disadvantage is that if you work with a lot of documents,  this process can take a lot of time. 
Here is where Artificial Intelligence comes in. Currently, classical machine learning methods, that use statistical  and linguistic features, are widely used for the extraction process. Now with deep learning, it is possible to capture  the semantic meaning of a text even better than these classical methods. Classical methods look at the frequency,  occurrence and order of words in the text, whereas these neural approaches can capture long-term semantic dependencies  and context of words in a text.
""", padding=True, truncation=True, return_tensors="pt")

outputs = model(**inputs)

And then you’ll find that there’s a outputs.hidden_states property with what you’re looking for! If you really want to use the pipeline interface though, you could subclass the NER pipeline (Pipelines) and override the initialization (and any other relevant methods) to make it work. Let me know if you’d want to do that, and I can help you out!

1 Like

Thanks for your reply, yeah, i will have to rewrite the pipeline then. Ill check tomorrow where best to start, any help is appreciated :slight_smile:

Getting from the logits to the actual results of the pipeline doesnt look super complicated, but i might still come back to you, if your offer still stands.

Ill post the solution here in any case.

1 Like

Okay, apparently i just need to rewrite the forward method to not “eat” the hidden state

class KeyphraseExtractionPipeline(TokenClassificationPipeline):
    def __init__(self, model, *args, **kwargs):
        super().__init__(
            model=AutoModelForTokenClassification.from_pretrained(model_path),
            tokenizer=AutoTokenizer.from_pretrained(model),
            *args,
            **kwargs
        )

    def _forward(self, model_inputs):
        # Forward
        special_tokens_mask = model_inputs.pop("special_tokens_mask")
        offset_mapping = model_inputs.pop("offset_mapping", None)
        sentence = model_inputs.pop("sentence")

        outputs = self.model(**model_inputs)
        logits = outputs[0]

        hidden_state = outputs[1]

        return {
            "logits": logits,
            "special_tokens_mask": special_tokens_mask,
            "offset_mapping": offset_mapping,
            "sentence": sentence,
            "hidden_state": hidden_state,  # Add hidden state to the returned dictionary
            **model_inputs,
        }

    def postprocess(self, model_outputs):
        results = super().postprocess(
            model_outputs=model_outputs,
            aggregation_strategy=AggregationStrategy.SIMPLE,
        )
        return {"keywords": np.unique([result.get("word").strip() for result in results]),
                "hidden_state": model_outputs["hidden_state"]}
2 Likes