Abstract
Transformers have emerged as powerful implicit rendering models, capable of performing geometric reasoning and producing photorealistic novel views in a single feedforward pass. A central challenge in these architectures is how to inject camera parameters into the transformer in a way that generalises across diverse sensing conditions. In this work, we present RoRE, an approach that embeds image patches directly as rays, using a learning based RoPE. This ray-based formulation provides a unified and general representation, improving robustness to unconventional camera geometries and sensing modalities. We evaluate our approach on conventional perspective imagery, fisheye cameras, and multi-modal RGB-thermal setups, showing that a single network can flexibly integrate arbitrary numbers of cameras and modalities into a coherent scene representation. Experiments demonstrate improved generalisation and cross-modal consistency compared to existing methods, highlighting the potential for relative ray-based embeddings to build adaptable, plug-and-play vision systems.
Multiple images with different modalities are encoded using a ViT. This network can render novel views of the captured scene in any of the input modalities.
Masked Inputs
Our model is trained to handle masked inputs, allowing it to geometrically understand the scene even when portions of the input images are missing. As you increase the masking percentage, observe how the model continues to render coherent views of the scene, demonstrating its robust geometric reasoning capabilities.
Hint: Drag the slider to change the number of masked tokens.
BibTeX
@inproceedings{rore2026,
title={RoRE: Rotary Ray Embedding for Generalised Multi-Modal Scene Understanding},
author={Ryan Griffiths and Donald G. Dansereau},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=BR2ItBcqOo}
}