LangGraph supports streaming via the self.graph.stream method. However, the examples provided by Hugging Face only demonstrate direct LLM invocation (e.g., OpenAI LLMs). It appears that streaming from the HuggingFacePipeline is not supported.
For instance, after building the graph and invoking self.graph.stream with streaming_mode set to "messages", the expected behavior is for the response to stream incrementally when the generate node is triggered. However, instead of streaming, the graph API buffers the output and only pushes the full response after completion. This suggests that the yield method is not functioning as expected.
Has anyone managed to get this working? Below is an example of the generate method:
Hey @sheepyyy,
I’m not too familiar with LangGraph specifically, but I am with LangChain, So I tweaked a few things, and the following reproducer works fine, yield correctly streams the responses without any issues.
from langchain_huggingface.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
model_id = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=100,
)
def generate(pipe, prompt):
hf = HuggingFacePipeline(pipeline=pipe)
for chunk in hf.stream(prompt):
yield chunk
prompt = "Hugging Face is"
for chunk in generate(pipe, prompt):
print(chunk, end="", flush=True)
If you’re implementing your own class and generate method, I’d recommend first initializing a HuggingFacePipeline instance, passing in the pipeline, and then calling the stream method on the instance instead of invoking stream as a class method.
Could you try this approach and modify the yield as per your requirements and check whether it works for you or not?
If the issue persists, feel free to share more details, and we can look into this further.
Hey there, first of all thank you for your response! The solution you have got there is basically where I got to until I get stuck when trying to wrap it around LangGraph, which is apparently the “new ways of doing things” going forward. Therefore, I was wondering if that was a limitation from the graph.stream funciton itself or if I have missed out any sneaky arguments/config/parameters. I guess for now I can use this OG way to stream the LLM tokens. For reference, the graph streaming code I was referring to is this one: How to stream LLM tokens from your graph