Not logged in.

Quick Search - Contribution

Contribution Details

Type	Conference or Workshop Paper
Scope	Discipline-based scholarship
Published in Proceedings	Yes
Title	Recurrent Vision Transformers for Object Detection with Event Cameras
Organization Unit	Department of Informatics (Burkhard Stiller)
Authors	Mathias Gehrig Davide Scaramuzza
Presentation Type	paper
Item Subtype	Original Work
Refereed	Yes
Status	Published in final form
Language	English
ISBN	979-8-3503-0129-8
ISSN	1063-6919
Page Range	13884 - 13893
Event Title	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Event Type	conference
Event Location	Vancouver, BC, Canada
Event Start Date	June 18 - 2023
Event End Date	June 22 - 2023
Series Name	IEEE Conference on Computer Vision and Pattern Recognition. Proceedings
Publisher	Institute of Electrical and Electronics Engineers
Abstract Text	We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with submillisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: first, a convolutional prior that can be regarded as a conditional positional embedding. Second, local and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (< 12 ms on a T4 GPU) and favorable parameter efficiency (5 × fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
Digital Object Identifier	10.1109/CVPR52729.2023.01334
PDF File	Download from ZORA
Export	BibTeX EP3 XML (ZORA)