Ruben Gomez-Ojeda, Francisco-Angel Moreno, David Zuniga-Noel, Davide Scaramuzza, Javier Gonzalez-Jimenez, PL-SLAM: A Stereo SLAM System Through the Combination of Points and Line Segments, IEEE Transactions on Robotics, Vol. 35 (3), 2020. (Journal Article)
Traditional approaches to stereo visual simultaneous localization and mapping (SLAM) rely on point features to estimate the camera trajectory and build a map of the environment. In low-textured environments, though, it is often difficult to find a sufficient number of reliable point features and, as a consequence, the performance of such algorithms degrades. This paper proposes PL-SLAM, a stereo visual SLAM system that combines both points and line segments to work robustly in a wider variety of scenarios, particularly in those where point features are scarce or not well-distributed in the image. PL-SLAM leverages both points and line segments at all the instances of the process: visual odometry, keyframe selection, bundle adjustment, etc. We contribute also with a loop-closure procedure through a novel bag-of-words approach that exploits the combined descriptive power of the two kinds of features. Additionally, the resulting map is richer and more diverse in three-dimensional elements, which can be exploited to infer valuable, high-level scene structures, such as planes, empty spaces, ground plane, etc. (not addressed in this paper). Our proposal has been tested with several popular datasets (such as EuRoC or KITTI), and is compared with state-of-the-art methods such as ORB-SLAM2, revealing a more robust performance in most of the experiments while still running in real time. An open-source version of the PL-SLAM C++ code has been released for the benefit of the community. |
|
Timo Stoffregen, Guillermo Gallego, Tom Drummond, Lindsay Kleeman, Davide Scaramuzza, Event-Based Motion Segmentation by Motion Compensation, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019-11-27. (Conference or Workshop Paper published in Proceedings)
In contrast to traditional cameras, whose pixels have a common exposure time, event-based cameras are novel bio-inspired sensors whose pixels work independently and asynchronously output intensity changes (called "events"), with microsecond resolution. Since events are caused by the apparent motion of objects, event-based cameras sample visual information based on the scene dynamics and are, therefore, a more natural fit than traditional cameras to acquire motion, especially at high speeds, where traditional cameras suffer from motion blur. However, distinguishing between events caused by different moving objects and by the camera's ego-motion is a challenging task. We present the first per-event segmentation method for splitting a scene into independently moving objects. Our method jointly estimates the event-object associations (i.e., segmentation) and the motion parameters of the objects (or the background) by maximization of an objective function, which builds upon recent results on event-based motion-compensation. We provide a thorough evaluation of our method on a public dataset, outperforming the state-of-the-art by as much as 10%. We also show the first quantitative evaluation of a segmentation algorithm for event cameras, yielding around 90% accuracy at 4 pixels relative displacement. |
|
Daniel Gehrig, Antonio Loquercio, Konstantinos Derpanis, Davide Scaramuzza, End-to-End Learning of Representations for Asynchronous Event-Based Data, In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, 2019-11-27. (Conference or Workshop Paper published in Proceedings)
Event cameras are vision sensors that record asynchronous streams of per-pixel brightness changes, referred to as "events”. They have appealing advantages over frame based cameras for computer vision, including high temporal resolution, high dynamic range, and no motion blur. Due to the sparse, non-uniform spatio-temporal layout of the event signal, pattern recognition algorithms typically aggregate events into a grid-based representation and subsequently process it by a standard vision pipeline, e.g., Convolutional Neural Network (CNN). In this work, we introduce a general framework to convert event streams into grid-based representations by means of strictly differentiable operations. Our framework comes with two main advantages: (i) allows learning the input event representation together with the task dedicated network in an end to end manner, and (ii) lays out a taxonomy that unifies the majority of extant event representations in the literature and identifies novel ones. Empirically, we show that our approach to learning the event representation end-to-end yields an improvement of approximately 12% on optical flow estimation and object recognition over state-of-the-art methods. |
|
Florian Fuchs, Reinforcement Learning for Race Car Driving in Gran Turismo Sport, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
In this work, we present the application of state-of-the-art Reinforcement Learning algorithms to develop a real-time autonomous control policy for the car racing simulation Gran Turismo Sport. Such a policy needs to deal with the highly non-linear state space experienced during racing. Previous work in the area of Optimal Control often explicitly modeled the dynamics of cars to generate close to optimal trajectories for a controller to follow. We present a model-free approach that learns to implicitly plan and follow trajectories, using a neural network to directly generate steering, throttle, and brake signals. The approach does not require a map of the environment but only uses features measured on the go. The trained policy learned human-like behavior, such as cutting curves and stable drifting. In the presented benchmark setting, we achieve lap times comparable to those of human expert drivers, being faster than the median A-ranked driver in the used reference dataset, A being the highest rank in Gran Turismo Sport, only held by ~2% of human drivers. The new agent furthermore outperforms the built-in Gran Turismo AI by a margin of ~4%. |
|
Titus Cieslewski, Konstantinos G Derpanis, Davide Scaramuzza, SIPs: Succinct Interest Points from Unsupervised Inlierness Probability Learning, In: 2019 International Conference on 3D Vision (3DV), IEEE, 2019-10-16. (Conference or Workshop Paper published in Proceedings)
A wide range of computer vision algorithms rely on identifying sparse interest points in images and establishing correspondences between them. However, only a subset of the initially identified interest points results in true correspondences (inliers). In this paper, we seek a detector that finds the minimum number of points that are likely to result in an application-dependent "sufficient" number of inliers k. To quantify this goal, we introduce the "k-succinctness" metric. Extracting a minimum number of interest points is attractive for many applications, because it can reduce computational load, memory, and data transmission. Alongside succinctness, we introduce an unsupervised training methodology for interest point detectors that is based on predicting the probability of a given pixel being an inlier. In comparison to previous learned detectors, our method requires the least amount of data pre-processing. Our detector and other state-of-the-art detectors are extensively evaluated with respect to succinctness on popular public datasets covering both indoor and outdoor scenes, and both wide and narrow baselines. In certain cases, our detector is able to obtain an equivalent amount of inliers with as little as 60% of the amount of points of other detectors. The code and trained networks are provided at https://github.com/uzh-rpg/sips2_open. |
|
Titus Cieslewski, Andreas Ziegler, Davide Scaramuzza, Exploration Without Global Consistency Using Local Volume Consolidation, In: IFRR International Symposium on Robotics Research (ISRR), Hanoi, 2019, IEEE, IFRR, 2019-10-06. (Conference or Workshop Paper published in Proceedings)
|
|
Titus Cieslewski, Michael Bloesch, Davide Scaramuzza, Matching Features without Descriptors: Implicitly Matched Interest Points, In: British Machine Vision Conference (BMVC), Cardiff, 2019, arXiv, BMVA, 2019-09-09. (Conference or Workshop Paper published in Proceedings)
|
|
Jeffrey Delmerico, Stefano Mintchev, Alessandro Giusti, Boris Gromov, Kamilo Melo, Tomislav Horvat, Cesar Cadena, Marco Hutter, Auke Ijspeert, Dario Floreano, Luca M Gambardella, Roland Siegwart, Davide Scaramuzza, The current state and future outlook of rescue robotics, Journal of Field Robotics, Vol. 36 (7), 2019. (Journal Article)
Robotic technologies, whether they are remotely operated vehicles, autonomous agents, assistive devices, or novel control interfaces, offer many promising capabilities for deployment in real world environments. Post-disaster scenarios are a particularly relevant target for applying such technologies, due to the challenging conditions faced by rescue workers and the possibility to increase their efficacy while decreasing the risks they face. However, field deployable technologies for rescue work have requirements for robustness, speed, versatility, and ease of use that may not be matched by the state of the art in robotics research. This
paper aims to survey the current state of the art in ground and aerial robots, marine and amphibious systems, and human-robot control interfaces and assess the readiness of these
technologies with respect to the needs of first responders and disaster recovery efforts. We have gathered expert opinions from emergency response stakeholders and researchers who conduct field deployments with them in order to understand these needs, and we present this assessment as a way to guide future research toward technologies that will make an impact in real world disaster response and recovery. |
|
Cedric Scheerlinck, Henri Rebecq, Timo Stoffregen, Nick Barnes, Robert Mahony, Davide Scaramuzza, CED: Color Event Camera Dataset, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), IEEE, 2019-07-16. (Conference or Workshop Paper published in Proceedings)
Event cameras are novel, bio-inspired visual sensors, whose pixels output asynchronous and independent timestamped spikes at local intensity changes, called 'events'. Event cameras offer advantages over conventional frame-based cameras in terms of latency, high dynamic range (HDR) and temporal resolution. Until recently, event cameras have been limited to outputting events in the intensity channel, however, recent advances have resulted in the development of color event cameras, such as the Color-DAVIS346. In this work, we present and release the first Color Event Camera Dataset (CED), containing 50 minutes of footage with both color frames and events. CED features a wide variety of indoor and outdoor scenes, which we hope will help drive forward event-based vision research. We also present an extension of the event camera simulator ESIM that enables simulation of color events. Finally, we present an evaluation of three state-of-the-art image reconstruction methods that can be used to convert the Color-DAVIS346 into a continuous-time, HDR, color video camera to visualise the event stream, and for use in downstream vision applications. |
|
Guillermo Gallego, Mathias Gehrig, Davide Scaramuzza, Focus Is All You Need: Loss Functions for Event-Based Vision, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2019-07-15. (Conference or Workshop Paper published in Proceedings)
Event cameras are novel vision sensors that output pixel-level brightness changes ("events") instead of traditional video frames. These asynchronous sensors offer several advantages over traditional cameras, such as, high temporal resolution, very high dynamic range, and no motion blur. To unlock the potential of such sensors, motion compensation methods have been recently proposed. We present a collection and taxonomy of twenty two objective functions to analyze event alignment in motion compensation approaches. We call them focus loss functions since they have strong connections with functions used in traditional shape-from-focus applications. The proposed loss functions allow bringing mature computer vision tools to the realm of event cameras. We compare the accuracy and runtime performance of all loss functions on a publicly available dataset, and conclude that the variance, the gradient and the Laplacian magnitudes are among the best loss functions. The applicability of the loss functions is shown on multiple tasks: rotational motion, depth and optical flow estimation. The proposed focus loss functions allow to unlock the outstanding properties of event cameras. |
|
Henri Rebecq, Rene Ranftl, Vladlen Koltun, Davide Scaramuzza, Events-To-Video: Bringing Modern Computer Vision to Event Cameras, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2019-07-15. (Conference or Workshop Paper published in Proceedings)
Event cameras are novel sensors that report brightness changes in the form of asynchronous “events” instead of intensity frames. They have significant advantages over conventional cameras: high temporal resolution, high dynamic range, and no motion blur. Since the output of event cameras is fundamentally different from conventional cameras, it is commonly accepted that they require the development of specialized algorithms to accommodate the particular nature of events. In this work, we take a different view and propose to apply existing, mature computer vision techniques to videos reconstructed from event data. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. Our experiments show that our approach surpasses state-of-the-art reconstruction methods by a large margin (> 20%) in terms of image quality. We further apply off-the-shelf computer vision algorithms to videos reconstructed from event data on tasks such as object classification and visual-inertial odometry, and show that this strategy consistently outperforms algorithms that were specifically designed for event data. We believe that our approach opens the door to bringing the outstanding properties of event cameras to an entirely new range of tasks. A video of the experiments is available at https://youtu.be/IdYrC4cUO0I. |
|
Yanchao Yang, Antonio Loquercio, Davide Scaramuzza, Stefano Soatto, Unsupervised Moving Object Detection via Contextual Information Separation, In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2019-07-15. (Conference or Workshop Paper published in Proceedings)
We propose an adversarial contextual model for detecting moving objects in images. A deep neural network is trained to predict the optical flow in a region using information from everywhere else but that region (context), while another network attempts to make such context as uninformative as possible. The result is a model where hypotheses naturally compete with no need for explicit regularization or hyper-parameter tuning. Although our method requires no supervision whatsoever, it outperforms several methods that are pre-trained on large annotated datasets. Our model can be thought of as a generalization of classical variational generative region-based segmentation, but in a way that avoids explicit regularization or solution of partial differential equations at run-time. |
|
Jeffrey Delmerico, Titus Cieslewski, Henri Rebecq, Matthias Faessler, Davide Scaramuzza, Are We Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset, In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019-06-20. (Conference or Workshop Paper published in Proceedings)
Despite impressive results in visual-inertial state estimation in recent years, high speed trajectories with six degree of freedom motion remain challenging for existing estimation algorithms. Aggressive trajectories feature large accelerations and rapid rotational motions, and when they pass close to objects in the environment, this induces large apparent motions in the vision sensors, all of which increase the difficulty in estimation. Existing benchmark datasets do not address these types of trajectories, instead focusing on slow speed or constrained trajectories, targeting other tasks such as inspection or driving. We introduce the UZH-FPV Drone Racing dataset, consisting of over 27 sequences, with more than 10 km of flight distance, captured on a first-person-view (FPV) racing quadrotor flown by an expert pilot. The dataset features camera images, inertial measurements, event-camera data, and precise ground truth poses. These sequences are faster and more challenging, in terms of apparent scene motion, than any existing dataset. Our goal is to enable advancement of the state of the art in aggressive motion estimation by providing a dataset that is beyond the capabilities of existing state estimation algorithms. |
|
Zichao Zhang, Davide Scaramuzza, Beyond Point Clouds: Fisher Information Field for Active Visual Localization, In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019-06-20. (Conference or Workshop Paper published in Proceedings)
For mobile robots to localize robustly, actively considering the perception requirement at the planning stage is essential. In this paper, we propose a novel representation for active visual localization. By formulating the Fisher information and sensor visibility carefully, we are able to summarize the localization information into a discrete grid, namely the Fisher information field. The information for arbitrary poses can then be computed from the field in constant time, without the need of costly iterating all the 3D landmarks. Experimental results on simulated and real-world data show the great potential of our method in efficient active localization and perception-aware planning. To benefit related research, we release our implementation of the information field to the public. |
|
Samuel Bryner, Guillermo Gallego, Henri Rebecq, Davide Scaramuzza, Event-based, Direct Camera Tracking from a Photometric 3D Map using Nonlinear Optimization, In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019-06-20. (Conference or Workshop Paper published in Proceedings)
Event cameras are novel bio-inspired vision sensors that output pixel-level intensity changes, called “events”, instead of traditional video images. These asynchronous sensors naturally respond to motion in the scene with very low latency (microseconds) and have a very high dynamic range. These features, along with a very low power consumption, make event cameras an ideal sensor for fast robot localization and wearable applications, such as AR/VR and gaming. Considering these applications, we present a method to track the 6-DOF pose of an event camera in a known environment, which we contemplate to be described by a photometric 3D map (i.e., intensity plus depth information) built via classic dense 3D reconstruction algorithms. Our approach uses the raw events, directly, without intermediate features, within a maximum-likelihood framework to estimate the camera motion that best explains the events via a generative model. We successfully evaluate the method using both simulated and real data, and show improved results over the state of the art. We release the datasets to the public to foster reproducibility and research in this topic. |
|
Elia Kaufmann, Mathias Gehrig, Philipp Foehn, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, Davide Scaramuzza, Beauty and the Beast: Optimal Methods Meet Learning for Drone Racing, In: 2019 International Conference on Robotics and Automation (ICRA), IEEE, 2019-06-20. (Conference or Workshop Paper published in Proceedings)
Autonomous micro aerial vehicles still struggle with fast and agile maneuvers, dynamic environments, imperfect sensing, and state estimation drift. Autonomous drone racing brings these challenges to the fore. Human pilots can fly a previously unseen track after a handful of practice runs. In contrast, state-of-the-art autonomous navigation algorithms require either a precise metric map of the environment or a large amount of training data collected in the track of interest. To bridge this gap, we propose an approach that can fly a new track in a previously unseen environment without a precise map or expensive data collection. Our approach represents the global track layout with coarse gate locations, which can be easily estimated from a single demonstration flight. At test time, a convolutional network predicts the poses of the closest gates along with their uncertainty. These predictions are incorporated by an extended Kalman filter to maintain optimal maximum-a-posteriori estimates of gate locations. This allows the framework to cope with misleading high-variance estimates that could stem from poor observability or lack of visible gates. Given the estimated gate poses, we use model predictive control to quickly and accurately navigate through the track. We conduct extensive experiments in the physical world, demonstrating agile and robust flight through complex and diverse previously-unseen race tracks. The presented approach was used to win the IROS 2018 Autonomous Drone Race Competition, outracing the second-placing team by a factor of two. |
|
Zichao Zhang, Davide Scaramuzza, Rethinking Trajectory Evaluation for SLAM: a Probabilistic, Continuous-Time Approach, In: ICRA19 Workshop on Dataset Generation and Benchmarking of SLAM Algorithms for Robotics and VR/AR, arxiv, IEEE, 2019-05-20. (Conference or Workshop Paper published in Proceedings)
Despite the existence of different error metrics for trajectory evaluation in SLAM, their theoretical justifications and connections are rarely studied, and few methods handle temporal association properly. In this work, we propose to formulate the trajectory evaluation problem in a probabilistic, continuous-time framework. By modeling the groundtruth as random variables, the concepts of absolute and relative error are generalized to be likelihood. Moreover, the groundtruth is represented as a piecewise Gaussian Process in continuous-time. Within this framework, we are able to establish theoretical connections between relative and absolute error metrics and handle temporal association in a principled manner. |
|
Roy Rutishauser, Robust Fiducal Marker Detection with Fully Convolutional Neural Network, University of Zurich, Faculty of Business, Economics and Informatics, 2019. (Master's Thesis)
Fiducial marker systems offer an alternative means of camera pose estimation to keypoint-based methods. Modern systems such as Aruco work with squareshaped markers with an external, black border comprising a black-and-white bit pattern. Their articial appearance makes them easy to spot in many real world environments. Nevertheless, state-of-the-art methods still perform poorly under challenging conditions such as motion blur, dicult view angles, small scale or non-uniform lighting. We propose a new detection system based on fully convolutional neural networks trained on synthetic data. By introducing several visual and spatial transformations to the synthetic markers we aim to add more robustness than current detection systems. With synthetic and real world experiments we show that our method is in fact able to detect more markers from a greater distance, distored with motion blur or under diffcult lighting conditions. |
|
Henri Rebecq, Rene Ranftl, Vladlen Koltun, Davide Scaramuzza, High Speed and High Dynamic Range Video with an Event Camera, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43 (6), 2019. (Journal Article)
Event cameras are novel sensors that report brightness changes in the form of a stream of asynchronous "events" instead of intensity frames. They offer significant advantages with respect to conventional cameras: high temporal resolution, high dynamic range, and no motion blur. While the stream of events encodes in principle the complete visual signal, the reconstruction of an intensity image from a stream of events is an ill-posed problem in practice. Existing reconstruction approaches are based on hand-crafted priors and strong assumptions about the imaging process as well as the statistics of natural images. In this work we propose to learn to reconstruct intensity images from event streams directly from data instead of relying on any hand-crafted priors. We propose a novel recurrent network to reconstruct videos from a stream of events, and train it on a large amount of simulated event data. During training we propose to use a perceptual loss to encourage reconstructions to follow natural image statistics. We further extend our approach to synthesize color images from color event streams. Our quantitative experiments show that our network surpasses state-of-the-art reconstruction methods by a large margin in terms of image quality (>20%), while comfortably running in real-time. We show that the network is able to synthesize high framerate videos (> 5,000 frames per second) of high-speed phenomena (e.g. a bullet hitting an object) and is able to provide high dynamic range reconstructions in challenging lighting conditions. As an additional contribution, we demonstrate the effectiveness of our reconstructions as an intermediate representation for event data. We show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as object classification and visual-inertial odometry and that this strategy consistently outperforms algorithms that were specifically designed for event data. We release the reconstruction code, a pre-t... |
|
Florentin Liebmann, Simon Roner, Marco von Atzigen, Davide Scaramuzza, Reto Sutter, Jess Snedeker, Mazda Farshad, Philipp Fürnstahl, Pedicle screw navigation using surface digitization on the Microsoft HoloLens, International Journal of Computer Assisted Radiology and Surgery, Vol. 14 (7), 2019. (Journal Article)
Purpose
In spinal fusion surgery, imprecise placement of pedicle screws can result in poor surgical outcome or may seriously harm a patient. Patient-specific instruments and optical systems have been proposed for improving precision through surgical navigation compared to freehand insertion. However, existing solutions are expensive and cannot provide in situ visualizations. Recent technological advancement enabled the production of more powerful and precise optical see-through head-mounted displays for the mass market. The purpose of this laboratory study was to evaluate whether such a device is sufficiently precise for the navigation of lumbar pedicle screw placement.
Methods
A novel navigation method, tailored to run on the Microsoft HoloLens, was developed. It comprises capturing of the intraoperatively reachable surface of vertebrae to achieve registration and tool tracking with real-time visualizations without the need of intraoperative imaging. For both surface sampling and navigation, 3D printable parts, equipped with fiducial markers, were employed. Accuracy was evaluated within a self-built setup based on two phantoms of the lumbar spine. Computed tomography (CT) scans of the phantoms were acquired to carry out preoperative planning of screw trajectories in 3D. A surgeon placed the guiding wire for the pedicle screw bilaterally on ten vertebrae guided by the navigation method. Postoperative CT scans were acquired to compare trajectory orientation (3D angle) and screw insertion points (3D distance) with respect to the planning.
Results
The mean errors between planned and executed screw insertion were 3.38∘±1.73∘
for the screw trajectory orientation and 2.77±1.46 mm for the insertion points. The mean time required for surface digitization was 125±27 s.
Conclusions
First promising results under laboratory conditions indicate that precise lumbar pedicle screw insertion can be achieved by combining HoloLens with our proposed navigation method. As a next step, cadaver experiments need to be performed to confirm the precision on real patient anatomy. |
|