Incremental decoding with T5

deathcrush · November 29, 2022, 12:36pm

Recently, we have seen evidence that in a variety of tasks, it may be helpful for a model to attend over intermediate computation steps when solving a task. An example is ReAct: Synergizing Reasoning and Acting in Language Models – Google AI Blog (googleblog.com). The authors cite some work from the neural program synthesis community where this approach was found beneficial.

Let’s assume we are processing conversations, where the context is progressively longer as the user and agent interact. Typically, we would re-encode the dialogue history and generate the answer from scratch for every interaction. Schematically, this could be represented as follows:

step 1: [usr] sent_1 → answer_1
step 2: [usr] sent_1 [agent] sent_1 [usr] sent_2 → answer_2
…
step k: [usr] sent_1 [agent] sent_1 [usr] sent_2 ... [agent] sent_k [user] sent_k → answer_k

Above sent is just an abbreviation for “sentence”. The LHS of “->” is the encoder input, the “RHS” is the decoder output. However, the answers are highly correlated, so arguably the model could predict more consistently if it was asked to show all the reasoning steps as the conversation progresses, instead of producing a single answer for the task. Schematically:

step 1: [usr] sent_1 → answer_1
step 2: [usr] sent_1 [agent] sent_1 [usr] sent_2 → answer_1 <sep> answer_2
…
step k: [usr] sent_1 [agent] sent_1 [usr] sent_2 ... [agent] sent_k [user] sent_k → answer_1 <sep> answer_2 <sep> … <sep> answer_k

In inference, this is problematic because concatenating the answers can lead to very long sequences if everything was generated from scratch. However, I was wondering if the use_cache feature together with the past_key_value could be used to effectively implement a memory on the decoder side? In the above, after we decode answer_1 we feed back the keys and values generated during decoding as past_key_values to decode answer_2. Then we would feed back the outputs to generate answer_3 and so on. So the model could attend over an updated conversational context and its past answers but would not “revise” all its previous answers.

@patrickvonplaten, am I naive to think that the caching during inference could be implemented with huggingface as is?

Topic		Replies	Views
Control EncoderDecoderModel to generate tokens step by step 🤗Transformers	8	2600	June 8, 2022
What is the purpose of 'use_cache' in decoder? 🤗Transformers	5	23724	July 4, 2023
Past_key_values - why not past_key_values_queries? Beginners	5	11067	October 15, 2023
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	218	January 22, 2025
Pass CausalLM KV cache into the next inference batch 🤗Transformers	0	563	October 14, 2023

Incremental decoding with T5

Related topics