Generating text word by word

Hello

I’m currently using GPT-J for generating text as shown below. This works well but it takes up to 5 seconds to generate the 100 tokens.

Is it possible to do the generation word by word or sentence by sentence? Similar to what ChatGPT is doing (ChatGPT seems to produce the output word by word).

import transformers
from transformers import GPTJForCausalLM
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token='<|endoftext|>', eos_token='<|endoftext|>', truncation_side='left')
model = GPTJForCausalLM.from_pretrained(
            "EleutherAI/gpt-j-6B",
            revision="float16",
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            use_cache=True,
            gradient_checkpointing=True,
        )
model.to("cuda")
prompt = tokenizer("This is a test sentence, which should be completed", return_tensors='pt', truncation=True, max_length=2000)
prompt = {key: value.to("cuda") for key, value in prompt.items()}
out = model.generate(**prompt,
                     n=1,
                     min_length=16,
                     max_new_tokens=100,
                     do_sample=True,
                     top_k=15,
                     top_p=0.9,
                     batch_size=1,
                     temperature=1,
                     no_repeat_ngram_size=4,
                     clean_up_tokenization_spaces=True,
                     use_cache=True,
                     pad_token_id=tokenizer.eos_token_id,
                     )
 res = tokenizer.decode(out[0])

Does somebody know a solution?

set stream=True

out = model.generate(**prompt,
                     n=1,
                     min_length=16,
                     max_new_tokens=100,
                     do_sample=True,
                     top_k=15,
                     top_p=0.9,
                     batch_size=1,
                     temperature=1,
                     no_repeat_ngram_size=4,
                     clean_up_tokenization_spaces=True,
                     use_cache=True,
                     pad_token_id=tokenizer.eos_token_id,
                     stream=True
                     )

for i in out:
    #some code to get streamed output

There is hundreds of example out there to do this