Multimodal perception for autonomous driving

Jorge Beltrán de la Cita

Ayuda

Multimodal perception for autonomous driving

Autores: Jorge Beltrán de la Cita
Directores de la Tesis: Arturo de la Escalera Hueso (dir. tes.), Fernando García Fenández (codir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2022
Idioma: inglés
Tribunal Calificador de la Tesis: Jesus Garcia Herrero (presid.), Ignacio Parra Alonso (secret.), Gustavo Adolfo Peláez Coronado (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de Madrid
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
- Ciencias tecnológicas
  - Tecnología de los sistemas de transporte
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- Autonomous driving is set to play an important role in intelligent transportation systems in the coming decades. The advantages of its implementation on a large scale - reduced accidents, reduced travel time, or optimization of consumption - have made its development a priority for academia and industry.
  
  In recent years, major advances in driving assistance systems, as well as the emergence of vehicles that can perform automated tasks such as parking or following a route through relatively simple environments with little traffic, suggest that the arrival of driverless vehicles will be a tangible reality in the near future. However, there is still a long way to go to achieve true autonomous driving, capable of facing any scenario without the need for human intervention. Current systems require to a greater or lesser extent the presence of a person behind the wheel, either assuming the main control, monitoring the correct operation of the software, or as a backup driver in case of software failure. To achieve this goal, advances are still needed in some of the technologies involved, such as control, navigation, and, especially, perception of the environment. Specifically, the correct detection of other road users that may interfere with the vehicle's path is a fundamental part, since it allows modeling the current state of traffic and, thus, making decisions accordingly.
  
  Understanding what is happening around the vehicle is the biggest challenge of all when replacing a driver. The ability of human interpretation and reasoning from the information captured through sight is a very complex task to replicate via software, due to the great variability of environments, objects, or unforeseen situations that occur during driving. However, the recent evolution in sensors, the increase in computational capacity, as well as the rise of deep learning networks, have opened up the possibility of developing perception systems capable of generating detections with a precision that makes safe navigation possible. To achieve this, the mounted sensor configurations should be able to capture meaningful data from the environment in any type of light or weather conditions. Hence, the majority of platforms are equipped with complimentary devices, being the most popular setup the combination of cameras and LiDARs. The joint use of video and 3D scene geometry information provides obvious advantages over systems using a single modality. However, it also poses numerous challenges that must be overcome in order to effectively exploit the signals captured by these sensors.
  
  The objective of this thesis is to provide solutions to some of the main challenges faced by onboard perception systems, such as extrinsic sensor calibration, object detection, and deployment on real platforms.
  
  First, we present an original self-calibration method tailored to automotive sensor setups composed of vision devices and multi-layer LiDAR scanners, eliminating the complex manual adjustment of these parameters. The approach makes use of a novel fiducial calibration target endowed with geometrical and visual characteristics that enable the unambiguous extraction of robust reference points in LiDAR, stereo, and monocular modalities. On the one hand, four circular holes are used to take advantage of geometrical discontinuities in LiDAR and stereo point clouds. On the other hand, four ArUco markers are placed near the corners so that 3D information can be inferred from monocular images. Once the pose of the target is estimated in each of the involved modalities, the optimal transform relating the pair of sensors is obtained through the registration of the detected 3D key points.
  
  The method does not impose severe limits on the relative pose between the devices and is therefore suitable for sensor setups where the magnitudes of the translation and rotation parameters are substantial. Only two reasonable constraints are required. First of all, there has to be an overlapping area between the sensors' field of view, where the calibration target is to be placed. Secondly, the holes in the pattern must be well visible in the data retrieved by the sensors; in particular, whenever range data is involved in the calibration, each circle must be represented by at least three points. In the case of multi-layer LiDAR sensors, this means that at least two scan planes intersect with each of the circles. Moreover, the parameters intrinsic to each device are assumed known.
  
  The procedure is designed to be performed in a static environment. Although the method can provide a quick estimate of the extrinsic parameters with just one pose of the target, it is possible to increase the accuracy and robustness of the results by accumulating several positions, as will be shown later.
  
  Along with the aforementioned method, we also introduce a novel framework for the assessment of extrinsic calibration algorithms based on a simulation environment. This software provides a perfect ground truth of the transform between sensors in space and establishes a fair benchmark for comparing different calibration approaches through metrics that truly represent the accuracy of the final estimation. Besides, it allows testing virtually unlimited sensor devices and relative poses to guarantee the generality of the results.
  
  An extensive set of experiments using the proposed evaluation benchmark shows the solution outperforms all previous works using calibration markers and offers better generalization capabilities than DNN-based alternatives, which require annotated training samples for each specific setup. Additional tests on real sensors corroborate the results obtained in the simulation environment, confirming the adequacy of the method for self-driving applications.
  
  Secondly, this thesis addresses the detection and classification of other traffic participants in 3D. Albeit most efforts in the field of object detection have been historically dedicated to the image space, the 2D characterization of elements in the surroundings is not sufficient to perform safe navigation.
  
  Even though some approaches have made use of the pinhole model to infer depth from the signal captured by monocular camera devices, the results are not accurate enough for automotive applications. Similarly, the degradation with the distance of the depth estimation computed through stereo pairs prevents its usage for medium and high-speed use cases.
  
  To this effect, the incorporation of range sensors into the perception pipeline is key to finding the spatial positioning of the road users around. In this regard, LiDAR scanners have shown superior performance over radar devices, as they provide more precise and consistent measurements. Indeed, thanks to the increment of the resolution of modern multi-layer LiDARs in the last decade, they have become part of the sensor configuration of most autonomous prototypes under development. By using a single device of this kind, vehicles can capture dense geometrical information of the environment in 360º. As a consequence, the detection and classification of objects in the scene may not only rely on image cues but can also benefit from laser features, traditionally relegated to object localization purposes. Although this brings new opportunities to foster new advances in the 3D object detection field, it also poses some challenges that need to be addressed.
  
  On the one hand, detection frameworks that take LiDAR information as input need to handle the bulkiness and sparsity of the point clouds in real-time, which is a non-trivial task. In this regard, several research lines are being explored, either considering the point cloud as-is or through a simplified representation. Using the raw cloud enables to preserve all features and avoid any loss of information, although the uneven layout of the points in space forces to build new model architectures to encode meaningful descriptors from unordered data, which often are computationally expensive. To ease the design of inference networks and deal with sparsity, different alternatives have emerged. Voxel-based approaches discretize the LiDAR points into spatial cells, losing some information along the way. However, the resulting format is an ordered grid with a regular distribution of data much easier to process. Similarly, LiDAR projections dispense some of the original information to produce 2D representations, which significantly enhance the efficiency during inference. On the other hand, new fusion strategies need to be developed to effectively combine features from complementary sensors like cameras and LiDARs, so that previous single-modality object detection approaches can be outperformed both in terms of accuracy and robustness.
  
  We propose the LiDAR's BEV projection as an effective trade-off between having a detailed representation of the scene geometry and offering a relatively simple data structure that can be processed efficiently. Unlike other approaches in the literature, our novel encoding contains distance-invariant features that enable the estimation of 3D cuboids for objects of distinct categories using a single frame as input.
  
  The convenience of its use for onboard 3D detection is studied in a twofold approach. First, a two-stage object detection pipeline based on RGB-oriented architectures is presented. Experiments show that our network is able to provide state-of-the-art results while working at almost 10Hz. This new pipeline leverages the feature encoders and RPN tailored for image detectors to classify and estimate the bounding box of objects using the BEV as input in a single forward pass. The vanilla version, which requires a post-processing stage to compute vertical parameters of the objects, outperformed all prior BEV-based approaches, although they were only focused on the detection of vehicles. The use of pyramid networks in the second iteration of the method boosts the accuracy of the identification of small objects leading to unprecedented results for works solely based on the LiDAR's top-view projection. Besides, the learned estimation of the box parameters in the vertical axis is proven beneficial for the overall performance of the network. Second, different fusion strategies are explored to exploit the joint use of image and BEV features in 3D object detection frameworks. To this end, a single-stage model is proposed so that real-time performance is guaranteed both when the network is used on its own or as a part of a more complex perception solution. The baseline architecture is able to provide multi-class 3D detection using the information from BEV images in an end-to-end fashion. Despite the use of a light backbone and the removal of slow proposal generation branches, the network outperforms other non-RPN approaches and yields results comparable or superior to some existing two-stage frameworks. Moreover, its efficient design enables a throughput rate of nearly 50Hz, making it an ideal solution for deployment in resource-constrained embedded computers. Regarding the multimodal fusion schemes, a manifold of strategies is evaluated, including early or sequential configurations. Besides, we propose the first single-stage network that leverages explicit LiDAR-image correspondences in the BEV space for multi-class object detection at the feature level. The new layer aims to mimic the functionality of the ROI pooling operation of RPN at a much lower computational cost. The tested approaches have delivered uneven results for the distinct object categories. However, the comprehensive experimentation has laid a solid basis for further investigation.
  
  Finally, the performance drop issue of current LiDAR-based detection frameworks at the deployment stage is studied. The unavailability of a sufficient number of multi-modal datasets for the training and evaluation of machine learning detection frameworks and the sensitivity of these approaches to changes in the sensor configuration lead to a notable accuracy loss when embedded on automated vehicles, which usually mount ad-hoc sensor sets designed to meet the requirements of specific use cases. Significant differences in the positioning of sensors or in the LiDAR characteristics produce substantial changes in the perceived representation of the objects. As a consequence, these differences often yield to a degradation of the models' performance, making them unsuitable for demanding applications such as autonomous driving.
  
  To address these limitations, modern datasets often include recordings from several countries and climate conditions. This increases the generalization capabilities of the networks and partially alleviates the problem. However, trained models struggle when being fed with data from custom sensor configurations, hampering its deployment on real vehicles.
  
  In this regard, some solutions have been developed to reduce the performance drop issue, under the name of domain adaptation (DA). Their goal is to learn domain-invariant features so that the model provides a similar accuracy in both the source (train) and target (test) domains. Although these methods have been successfully applied to tasks like 2D detection or semantic segmentation in images, their application to LiDAR is still limited to certain projections.
  
  To tackle this problem, we propose a method to automate the generation of new annotated datasets for 3D object detection using the information from existing ones. Concretely, the presented approach is geared towards building a 3D mesh representation of the driving scenario using the readings from subsequent LiDAR sweeps, so that a synthetic point cloud of the scene can be simulated for any possible rangefinder model. This way, both the inner specifications of the device and its relative position to the car coordinates can be chosen to recreate available 3D detection benchmarks as if they were captured using a different LiDAR sensor, eliminating the need for recording and labeling new datasets whenever a novel sensor configuration is built. As the sensitivity to light reflections of distinct LiDAR devices does not behave uniformly, the work focuses on the generation of realistic spatial coordinates for the points in the cloud, disregarding intensity values.
  
  The presented approach is divided into three stages. First, dynamic objects are isolated from the static parts of the scene so that multiple frames can be aggregated properly. Second, a mesh of triangles is approximated to a 3D point cloud resulting from the accumulation of a sequence of LiDAR clouds received within a given time frame. Lastly, a virtual LiDAR device is simulated by ray tracing the laser beams defined by its internal parameters, e.g., layers distribution and horizontal resolution. Even though the formulation of the solution allows for single-frame operation, the joint use of multiple LiDAR clouds has proved to enhance the synthetic output.
  
  The experimental analysis of the proposed method for synthetic point clouds generation proves the utility of the approach to reduce the performance drop of 3D object detectors caused by the often significant differences between the training and testing domains. Conducted tests demonstrate that the solution is a leap forward on the deployment of LiDAR-based perception networks into real vehicles by allowing the usage of custom sensor configurations, which were previously limited to those used in available datasets. Moreover, the presented pipeline opens the door to the standardization of existing 3D detection benchmarks to any LiDAR sensor and may facilitate the adoption of upcoming devices in the incipient and fast-changing market of laser scanners.
  
  In order to obtain on-field insights about the suitability of the synthetic LiDAR datasets created through the method introduced in the previous section, a 3D object detection network has been trained to serve as the main perception stack of a self-driving car prototype. The proposed perception stack is composed of three separate stages: 2D detection, 3D estimation, and tracking. Since the inference of the final 3D boxes relies on the well-known Frustum-PointNet framework, a previous step is required to feed the network with the necessary image object detections. Once the object instances have been properly located and classified, the tracking module provides consistency over time. By tracking the dynamic agents across frames, the impact of instant misdetections in the preceding stage can be mitigated. The combination of these three components enables accurate and robust identification of the different road participants surrounding the vehicle.
  
  Field tests in open traffic assess the validity of the method, as the presented perception pipeline enables safe real-time navigation in open traffic during almost the whole duration of the rides in the operational design domain.
  
  The different works included in this thesis are tailored to improve the understanding of the traffic situation around the vehicle, with a focus on multimodal perception systems. The extensive experimentation demonstrates the accuracy of the proposed methods both on reference datasets in the field of computer vision and in real environments. The results obtained show the relevance of the presented contributions and their feasibility for commercial use.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: