RoRE

Griffiths, Ryan; Dansereau, Donald G.

RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding

Ryan Griffiths, Donald G. Dansereau

Australian Centre for Robotics, University of Sydney

Paper Code

ICLR 2026

We introduce Rotary Ray-based Encoding (RoRE) and a multi-modal training strategy to enable geometric vision across different modalties.

Abstract

Transformers have emerged as powerful implicit rendering models, capable of performing geometric reasoning and producing photorealistic novel views in a single feedforward pass. A central challenge in these architectures is how to inject camera parameters into the transformer in a way that generalises across diverse sensing conditions. In this work, we present RoRE, an approach that embeds image patches directly as rays, using a learning based RoPE. This ray-based formulation provides a unified and general representation, improving robustness to unconventional camera geometries and sensing modalities. We evaluate our approach on conventional perspective imagery, fisheye cameras, and multi-modal RGB-thermal setups, showing that a single network can flexibly integrate arbitrary numbers of cameras and modalities into a coherent scene representation. Experiments demonstrate improved generalisation and cross-modal consistency compared to existing methods, highlighting the potential for relative ray-based embeddings to build adaptable, plug-and-play vision systems.

Multiple images with different modalities are encoded using a ViT. This network can render novel views of the captured scene in any of the input modalities.

Comparing PSNR performance during training for LVSM and LVSM+RoRE. Adding RoRE embedding to LVSM improves overall performance.

Masked Inputs

Our model is trained to handle masked inputs, allowing it to geometrically understand the scene even when portions of the input images are missing. As you increase the masking percentage, observe how the model continues to render coherent views of the scene, demonstrating its robust geometric reasoning capabilities.

Input Images

Renderings

Masked 50 %

Hint: Drag the slider to change the number of masked tokens.

BibTeX

@inproceedings{rore2026,
  title={RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding},
  author={Ryan Griffiths and Donald G. Dansereau},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=BR2ItBcqOo}
}