In natural human conversation or reaction times, delays are barely noticeable. However, for a 150lb bipedal robot carrying a heavy box of glass, Inference Latency is the difference between a successful completion and a catastrophic fall.
The Round Trip Bottleneck
When a robot looks at an environment, the camera frame must be encoded, pushed through a local Vision-Language Model (or relayed over WiFi to a cloud server), parsed, decoded into a motor torque trajectory, and sent back to the limbs. The industry standard for safe autonomous operation requires this entire loop to function at roughly 20 to 50 milliseconds. Large language models inherently struggle with this demand due to severe compute overhead.