New RSS paper: AR-VLA is an autoregressive action expert that boosts the temporal consistency of VLAs and diffusion policies!

AR-VLA [1] is a standalone autoregressive action expert that generates actions as a continuous causal sequence, conditioned on a (refreshable) vision-language prefix:

✅ Existing VLAs and diffusion policies reset temporal context with each new observation.
✅ Instead, our expert infers actions that are consistent with the long history it maintains internally, as well as with vision-language context that it pulls asynchronously.
✅ This structure addresses the frequency mismatch between fast control and slow perception. It enables efficient independent pre-training of kinematic syntax, and a modular integration with a costly perception backbone.

AR-VLA is a drop-in replacement for traditional chunk-based action heads in specialist or generalist policies. It shows superior history awareness and smoother action trajectories, with success rate equal or superior to SoA!

Check out the code & videos: https://arvla.insait.ai

References

[1]

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models.

Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, and Danda Paudel.

In Robotics: Science and Systems, 2026.

@inproceedings{hu2026a,
  author = {Hu, Yutong and Zaech, Jan-Nico and Nikolov, Nikolay and Yao, Yuanqi and Dey, Sombit and Albanese, Giuliano and Detry, Renaud and Gool, Luc Van and Paudel, Danda},
  title = {AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models},
  year = {2026},
  archiveprefix = {arXiv},
  booktitle = {Robotics: Science and Systems},
  primaryclass = {cs.RO},
  url = {https://arxiv.org/abs/2603.10126}
}