python src/lerobot/scripts/server/robot_client.py \
--server_address=127.0.0.1:8080 \
--robot.type=piper_follower \
--robot.port=can0 \
--robot.id=black \
--robot.cameras="{ \
top: {type: opencv, index_or_path: 0, width: 640, height: 480, fps: 30}, \
left: {type: opencv, index_or_path: 4, width: 640, height: 480, fps: 30}}" \
--task="Grasp the object and put it in the bin" \
--policy_type=smolvla \
--pretrained_name_or_path=outputs/train/piper_smolvla_pick_yellow_cars/checkpoints/020000/pretrained_model \
--policy_device=cuda \
--actions_per_chunk=50 \
--chunk_size_threshold=0.5 \
--aggregate_fn_name=weighted_average \
--debug_visualize_queue_size=True
server log
INFO 2025-10-01 16:08:42 y_server.py:191 Observation #0 has been filtered out
INFO 2025-10-01 16:08:42 ort/utils.py:74 <Logger policy_server (NOTSET)> Starting receiver
INFO 2025-10-01 16:08:42 y_server.py:175 Received observation #0 | Avg FPS: 27.46 | Target: 30.00 | One-way latency: 1.73ms
INFO 2025-10-01 16:08:42 y_server.py:191 Observation #0 has been filtered out
INFO 2025-10-01 16:08:42 ort/utils.py:74 <Logger policy_server (NOTSET)> Starting receiver
INFO 2025-10-01 16:08:42 y_server.py:175 Received observation #0 | Avg FPS: 27.46 | Target: 30.00 | One-way latency: 1.79ms
client log
INFO 2025-10-01 16:08:42 t_client.py:217 Sent observation #0 |
INFO 2025-10-01 16:08:42 t_client.py:470 Control loop (ms): 4.07
actions_available False
INFO 2025-10-01 16:08:42 t_client.py:217 Sent observation #0 |
INFO 2025-10-01 16:08:42 t_client.py:470 Control loop (ms): 5.66
actions_available False
There’s a robotics channel on the Hugging Face Discord, so asking there would be the most reliable option.
Also, the recently created Hugging Face science Discord, Hugging Science, might be a useful option in this case too.
Missing language input. Error in StreamActions: 'observation.language.tokens' means the SmolVLA policy was initialized with a language modality, but your client payload lacks the tokenized instruction. This is a known SmolVLA failure mode and shows up as a KeyError on that field. Send an instruction so the client or server produces tokens, or run a checkpoint/config without language. (GitHub)
Duplicate timestep.
“obs number not change” and “Skipping observation #0 – Timestep predicted already!” means the async server’s dedup logic sees the same step ID repeatedly, so it filters them. Async only enqueues actions after accepting a new observation step. Ensure the client increments the step index or uses fresh timestamps per frame. This behavior follows the async streaming contract. (Hugging Face)
Fixes in order:
A) Provide language or disable it
Fast path: pass a task string from the client so tokens are generated.
Example launch knobs are documented in the async guide; the policy expects language tokens when enabled. (Hugging Face)
No-language path: use a SmolVLA config or checkpoint that omits the language branch, or stub observation.language.tokens on the server with a valid empty/default tensor to confirm the rest of the pipeline. The exact KeyError has been reported and tracked. (GitHub)
B) Make steps monotonic
In your robot_client loop, increment timestep each send. Also ensure client_timestamp increases. Verify server logs advance as Received observation #0, #1, #2 …. The async tutorial describes the per-step flow and acceptance before action chunking. (Hugging Face)
C) Re-prime the async pipeline
Start policy_server, confirm the model loads and prints expected inputs.
Start robot_client with the instruction and fixed step counter. Actions appear only after the first valid step is accepted, which matches reports where no actions flow until inputs are correct. (Hugging Face)
Minimal verification checklist:
Server prints “Loaded SmolVLA …” and lists expected features including language if enabled. (Hugging Face)
First inference no longer throws the KeyError on observation.language.tokens. (GitHub)
Server log increments observation index and stops filtering as “predicted already.” (Hugging Face)
Client flips to actions_available: True after the first accepted step, consistent with prior async issues. (GitHub)
Context to remember:
SmolVLA expects inputs exactly as configured at train/eval time. If language is part of the policy, the language field is required. The SmolVLA writeups and docs describe language tokens as a standard input. (arXiv)
Async inference decouples sensing from acting. The server deduplicates per step to avoid recomputing. Reused IDs will be dropped by design. (Hugging Face)
Apply A and B, then recheck C. If anything still blocks, paste the server’s “expected inputs” line and a dump of received keys for one accepted step.