New RSS paper: AR-VLA is an autoregressive action expert that boosts the temporal consistency of VLAs and diffusion policies!

AR-VLA [1] is a standalone autoregressive action expert that generates actions as a continuous causal sequence, conditioned on a (refreshable) vision-language prefix:

✅ Existing VLAs and diffusion policies reset temporal context with each new observation.
✅ Instead, our expert infers actions that are consistent with the long history it maintains internally, as well as with vision-language context that it pulls asynchronously.
✅ This structure addresses the frequency mismatch between fast control and slow perception. It enables efficient independent pre-training of kinematic syntax, and a modular integration with a costly perception backbone.

AR-VLA is a drop-in replacement for traditional chunk-based action heads in specialist or generalist policies. It shows superior history awareness and smoother action trajectories, with success rate equal or superior to SoA!

Check out the code & videos: https://arvla.insait.ai

References

  1. [1]
    AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models.
    Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, and Danda Paudel.
    In Robotics: Science and Systems, 2026.