Lamarr Institute for Machine Learning and Artificial Intelligence
facilitySankt Augustin, Germany
Research output, citation impact, and the most-cited recent papers from Lamarr Institute for Machine Learning and Artificial Intelligence. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Lamarr Institute for Machine Learning and Artificial Intelligence
Accurate mapping of large-scale environments is an essential building block of most outdoor autonomous systems. Challenges of traditional mapping methods include the balance between memory consumption and mapping accuracy. This paper addresses the problem of achieving large-scale 3D reconstruction using implicit representations built from 3D LiDAR measurements. We learn and store implicit features through an octree-based, hierarchical structure, which is sparse and extensible. The implicit features can be turned into signed distance values through a shallow neural network. We leverage binary cross entropy loss to optimize the local features with the 3D measurements as supervision. Based on our implicit representation, we design an incremental mapping system with regularization to tackle the issue of forgetting in continual learning. Our experiments show that our 3D reconstructions are more accurate, complete, and memory-efficient than current state-of-the-art 3D mapping methods.
Autonomous vehicles need to understand their surroundings geometrically and semantically to plan and act appropriately in the real world. Panoptic segmentation of LiDAR scans provides a description of the surroundings by unifying semantic and instance segmentation. It is usually solved in a bottom-up manner, consisting of two steps. Predicting the semantic class for each 3D point, using this information to filter out “stuff” points, and cluster “thing” points to obtain instance segmentation. This clustering is a post-processing step with associated hyperparameters, which usually do not adapt to instances of different sizes or different datasets. To this end, we propose MaskPLS, an approach to perform panoptic segmentation of LiDAR scans in an end-to-end manner by predicting a set of non-overlapping binary masks and semantic classes, fully avoiding the clustering step. As a result, each mask represents a single instance belonging to a “thing” class or a “stuff” class. Experiments on SemanticKITTI show that the end-to-end learnable mask generation leads to superior performance compared to state-of-the-art heuristic approaches.
Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer.
Mobile robots that navigate in unknown environments need to be constantly aware of the dynamic objects in their surroundings for mapping, localization, and planning. It is key to reason about moving objects in the current observation and at the same time to also update the internal model of the static world to ensure safety. In this letter, we address the problem of jointly estimating moving objects in the current 3D LiDAR scan and a local map of the environment. We use sparse 4D convolutions to extract spatio-temporal features from scan and local map and segment all 3D points into moving and non-moving ones. Additionally, we propose to fuse these predictions in a probabilistic representation of the dynamic environment using a Bayes filter. This volumetric belief models, which parts of the environment can be occupied by moving objects. Our experiments show that our approach outperforms existing moving object segmentation baselines and even generalizes to different types of LiDAR sensors. We demonstrate that our volumetric belief fusion can increase the precision and recall of moving object segmentation and even retrieve previously missed moving objects in an online mapping scenario.
Plant phenotyping is a central task in agriculture, as it describes plants' growth stage, development, and other relevant quantities. Robots can help automate this process by accurately estimating plant traits such as the number of leaves, leaf area, and the plant size. In this paper, we address the problem of joint semantic, plant instance, and leaf instance segmentation of crop fields from RGB data. We propose a single convolutional neural network that addresses the three tasks simultaneously, exploiting their underlying hierarchical structure. We introduce task-specific skip connections, which our experimental evaluation proves to be more beneficial than the usual schemes. We also propose a novel automatic post-processing, which explicitly addresses the problem of spatially close instances, common in the agricultural domain because of overlapping leaves. Our architecture simultaneously tackles these problems jointly in the agricultural context. Previous works either focus on plant or leaf segmentation, or do not optimise for semantic segmentation. Results show that our system has superior performance compared to state-of-the-art approaches, while having a reduced number of parameters and is operating at camera frame rate.
Monitoring plants and fruits is important in modern agriculture, with applications ranging from high-throughput phenotyping to autonomous harvesting. Obtaining highly accurate 3D measurements under real agricultural conditions is a challenging task. In this letter, we address the problem of estimating the 3D shape of fruits when only a partial view is available. We propose a pipeline that exploits high-resolution 3D data in the learning phase but only requires a single RGB-D frame to predict the 3D shape of a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">complete</i> fruit during operation. To achieve this, we first learn a latent space of potential fruit appearances that we can decode into an SDF volume. With the pretrained, frozen decoder, we subsequently learn an encoder that can produce meaningful latent vectors from a single RGB-D frame. The experiments presented in this letter suggest that our approach can predict the 3D shape of whole fruits online, needing only 4 ms for inference. We evaluate our approach in controlled environments and illustrate its deployment in greenhouses without modifications.
Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.
The lack of interpretability in AI-based intrusion detection systems poses a critical barrier to their adoption in forensic cybersecurity, which demands high levels of reliability and verifiable evidence. To address this challenge, the integration of explainable artificial intelligence (XAI) into forensic cybersecurity offers a powerful approach to enhancing transparency, trust, and legal defensibility in network intrusion detection. This study presents a comparative analysis of SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) applied to Extreme Gradient Boosting (XGBoost) and Attentive Interpretable Tabular Learning (TabNet), using the UNSW-NB15 dataset. XGBoost achieved 97.8% validation accuracy and outperformed TabNet in explanation stability and global coherence. In addition to classification performance, we evaluate the fidelity, consistency, and forensic relevance of the explanations. The results confirm the complementary strengths of SHAP and LIME, supporting their combined use in building transparent, auditable, and trustworthy AI systems in digital forensic applications.
Agriculture faces several challenges including climate change and biodiversity loss while, at the same time, the demand for food, feed, biofuels, and fiber is increasing. Sustainable intensification aims to increase productivity and input-use efficiency while enhancing the resilience of agricultural systems to adverse environmental conditions through improved management and technology. Recent advances in sensing, machine learning, modeling, and robotics offer opportunities for novel smart digital technologies to enable sustainable intensification. However, developing smart digital technologies and putting them into agricultural practice, requires closing major research gaps, related in particular to (1) the utilization of multi-scale multi-sensor monitoring in space and time, (2) using artificial intelligence for linking process and data-driven methods, (3) improving decision making and intervention in plant production, and finally (4) modeling conditions and consequences of farmers acceptance. Closing these gaps requires an interdisciplinary approach. Here, we present a research agenda and steps forward to steer research efforts, highlighting research priorities, and identifying required interdisciplinary research collaboration. Following this agenda will leverage the full potential of smart digital technologies for sustainable crop production.
Scene understanding is crucial for autonomous robots in dynamic environments for making future state predictions, avoiding collisions, and path planning. Camera and LiDAR perception made tremendous progress in recent years, but face limitations under adverse weather conditions. To leverage the full potential of multi-modal sensor suites, radar sensors are essential for safety critical tasks and are already installed in most new vehicles today. In this letter, we address the problem of semantic segmentation of moving objects in radar point clouds to enhance the perception of the environment with another sensor modality. Instead of aggregating multiple scans to densify the point clouds, we propose a novel approach based on the self-attention mechanism to accurately perform sparse, single-scan segmentation. Our approach, called Gaussian Radar Transformer, includes the newly introduced Gaussian transformer layer, which replaces the softmax normalization by a Gaussian function to decouple the contribution of individual points. To tackle the challenge of the transformer to capture long-range dependencies, we propose our attentive up- and downsampling modules to enlarge the receptive field and capture strong spatial relations. We compare our approach to other state-of-the-art methods on the RadarScenes data set and show superior segmentation quality in diverse environments, even without exploiting temporal information.
Semantic perception is a core building block in autonomous driving, since it provides information about the drivable space and location of other traffic participants. For learning-based perception, often a large amount of diverse training data is necessary to achieve high performance. Data labeling is usually a bottleneck for developing such methods, especially for dense prediction tasks, e.g., semantic segmentation or panoptic segmentation. For 3D Li-DAR data, the annotation process demands even more effort than for images. Especially in autonomous driving, point clouds are sparse, and objects appearance depends on its distance from the sensor, making it harder to acquire large amounts of labeled training data. This paper aims at taking an alternative path proposing a self-supervised representation learning method for 3D LiDAR data. Our approach exploits the vehicle motion to match objects across time viewed in different scans. We then train a model to maximize the point-wise feature similarities from points of the associated object in different scans, which enables to learn a consistent representation across time. The experimental results show that our approach performs better than previous state-of-the-art self-supervised representation learning methods when fine-tuning to different downstream tasks. We furthermore show that with only 10% of labeled data, a network pre-trained with our approach can achieve better performance than the same network trained from scratch with all labels for semantic segmentation on SemanticKITTI. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> Code: https://github.com/PRBonn/TARL
Determining the state of a mobile robot is an essential building block of robot navigation systems. In this letter, we address the problem of estimating the robot's pose in an indoor environment using 2D LiDAR data and investigate how modern environment models can improve gold standard Monte-Carlo localization (MCL) systems. We propose a neural occupancy field to implicitly represent the scene using a neural network. With the pretrained network, we can synthesize 2D LiDAR scans for an arbitrary robot pose through volume rendering. Based on the implicit representation, we can obtain the similarity between a synthesized and actual scan as an observation model and integrate it into an MCL system to perform accurate localization. We evaluate our approach on self-recorded datasets and three publicly available ones. We show that we can accurately and efficiently localize a robot using our approach surpassing the localization performance of state-of-the-art methods. The experiments suggest that the presented implicit representation is able to predict more accurate 2D LiDAR scans leading to an improved observation model for our particle filter-based localization.
Mapping an environment is essential for several robotic tasks, particularly for localization. In this letter, we address the problem of mapping the environment using LiDAR point clouds with the goal to obtain a map representation that is well suited for robot localization. To this end, we utilize a neural network to learn a discretization-free distance field of a given scene for localization. In contrast to prior approaches, we directly work on the sensor data and do not assume a perfect model of the environment or rely on normals. Inspired by the recently proposed NeRF representations, we supervise the network by points sampled along the measured beams, and our loss is designed to learn a valid distance field. Additionally, we show how to perform scan registration and global localization directly within the neural distance field. We illustrate the capabilities to globally localize within an indoor environment utilizing a particle filter as well as to perform scan registration by tracking the pose of a car based on matching LiDAR scans to the neural distance field.
Active perception for fruit mapping and harvesting is a difficult task since occlusions occur frequently and the location as well as size of fruits change over time. State-of-the-art viewpoint planning approaches utilize computationally expensive ray casting operations to find good viewpoints aiming at maximizing information gain and covering the fruits in the scene. In this paper, we present a novel viewpoint planning approach that explicitly uses information about the predicted fruit shapes to compute targeted viewpoints that observe as yet unobserved parts of the fruits. Furthermore, we formulate the concept of viewpoint dissimilarity to reduce the sampling space for more efficient selection of useful, dissimilar viewpoints. Our simulation experiments with a UR5e arm equipped with an RGB-D sensor provide a quantitative demonstration of the efficacy of our iterative next best view planning method based on shape completion. In comparative experiments with a state-of-the-art viewpoint planner, we demonstrate improvement not only in the estimation of the fruit sizes, but also in their reconstruction, while significantly reducing the planning time. Finally, we show the viability of our approach for mapping sweet pepper plants with a real robotic system in a commercial glasshouse.
Precision farming robots offer the potential to reduce the amount of used agrochemicals through targeted interventions and thus are a promising step towards sustainable agriculture. A prerequisite for such systems is a robust plant classification system that can identify crops and weeds in various agricultural fields. Most vision-based systems train convolutional neural networks (CNNs) on a given dataset, i.e., the source domain, to perform semantic segmentation of images. However, deploying these models on unseen fields, i.e., in the target domain, often shows a low generalization capability. Enhancing the generalization capability of CNNs is critical to increasing their performance on target domains with different operational conditions. In this letter, we present a domain generalized semantic segmentation approach for robust crop and weed detection by effectively extending and diversifying the source domain to achieve high performance across different agricultural field conditions. We propose to leverage unlabeled images captured from various agricultural fields during training in a two-step framework. First, we suggest a method to automatically compute sparse annotations and use them to present the model more plant varieties and growth stages to enhance its generalization capability. Among others, we exploit unlabeled images from fields containing crops sown in rows. Second, we propose a style transfer method that renders the source domain images in the style of images from various fields to achieve increased diversification. We conduct extensive experiments and show that we achieve superior performance in crop-weed segmentation across various fields compared to state-of-the-art methods.
Accurately perceiving instances and predicting their future motion are key tasks for autonomous vehicles, enabling them to navigate safely in complex urban traffic. While bird’s-eye view (BEV) representations are commonplace in perception for autonomous driving, their potential in a motion prediction setting is less explored. Existing approaches for BEV instance prediction from surround cameras rely on a multi-task auto-regressive setup coupled with complex post-processing to predict future instances in a spatio-temporally consistent manner. In this paper, we depart from this paradigm and propose an efficient novel end-to-end framework named PowerBEV, which differs in several design choices aimed at reducing the inherent redundancy in previous methods. First, rather than predicting the future in an auto-regressive fashion, PowerBEV uses a parallel, multi-scale module built from lightweight 2D convolutional networks. Second, we show that segmentation and centripetal backward flow are sufficient for prediction, simplifying previous multi-task objectives by eliminating redundant output modalities. Building on this output representation, we propose a simple, flow warping-based post-processing approach which produces more stable instance associations across time. Through this lightweight yet powerful design, PowerBEV outperforms state-of-the-art baselines on the NuScenes Dataset and poses an alternative paradigm for BEV instance prediction. We made our code publicly available at: https://github.com/EdwardLeeLPZ/PowerBEV.
Unmanned aerial vehicles (UAVs) are frequently used for aerial mapping and general monitoring tasks. Recent progress in deep learning enabled automated semantic segmentation of imagery to facilitate the interpretation of large-scale complex environments. Commonly used supervised deep learning for segmentation relies on large amounts of pixelwise labeled data, which is tedious and costly to annotate. The domain-specific visual appearance of aerial environments often prevents the usage of models pretrained on publicly available datasets. To address this, we propose a novel general planning framework for UAVs to autonomously acquire informative training images for model retraining. We leverage multiple acquisition functions and fuse them into probabilistic terrain maps. Our framework combines the mapped acquisition function information into the UAV's planning objectives. In this way, the UAV adaptively acquires informative aerial images to be manually labeled for model retraining. Experimental results on real-world data and in a photorealistic simulation show that our framework maximizes model performance and drastically reduces labeling efforts. Our map-based planners outperform state-of-the-art local planning.
The ability to detect loop closures plays an essential role in any SLAM system. Loop closures allow correcting the drifting pose estimates from a sensor odometry pipeline. In this paper, we address the problem of effectively detecting loop closures in LiDAR SLAM systems in various environments with longer lengths of sequences and agnostic of the scanning pattern of the sensor. While many approaches for loop closures using 3D LiDAR sensors rely on individual scans, we propose the usage of local maps generated from locally consistent odometry estimates. Several recent approaches compute the maximum elevation map on a bird’s eye view projection of point clouds to compute feature descriptors. In contrast, we use a density image bird’s eye view representation, which is robust to viewpoint changes. The utilization of dense local maps allows us to reduce the complexity of features describing these maps, as well as the size of the database required to store these features over a long sequence. This yields a real-time application of our approach for a typical robotic 3D LiDAR sensor. We perform extensive experiments to evaluate our approach against other state-of-the-art approaches and show the benefits of our proposed approach.
We propose Social Diffusion, a novel method for short-term and long-term forecasting of the motion of multiple persons as well as their social interactions. Jointly forecasting motions for multiple persons involved in social activities is inherently a challenging problem due to the interdependencies between individuals. In this work, we leverage a diffusion model conditioned on motion histories and causal temporal convolutional networks to forecast individually and contextually plausible motions for all participants. The contextual plausibility is achieved via an order-invariant aggregation function. As a second contribution, we design a new evaluation protocol that measures the plausibility of social interactions which we evaluate on the Haggling dataset, which features a challenging social activity where people are actively taking turns to talk and switching their attention. We evaluate our approach on four datasets for multi-person forecasting where our approach outperforms the state-of-the-art in terms of motion realism and contextual plausibility.
Abstract The Standard Model of particle physics—the theory of particles and interactions at the smallest scale—predicts that matter and antimatter interact differently due to violation of the combined symmetry of charge conjugation (C) and parity (P). Charge conjugation transforms particles into their antimatter particles, whereas the parity transformation inverts spatial coordinates. This prediction applies to both mesons, which consist of a quark and an antiquark, and baryons, which are composed of three quarks. However, despite having been discovered in various meson decays, CP violation has yet to be observed in baryons, the type of matter that makes up the observable Universe. Here we report a study of the decay of the beauty baryon $${\varLambda }_{0}^{b}$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:msubsup> <mml:mrow> <mml:mi>Λ</mml:mi> </mml:mrow> <mml:mrow> <mml:mn>0</mml:mn> </mml:mrow> <mml:mrow> <mml:mi>b</mml:mi> </mml:mrow> </mml:msubsup> </mml:math> to the p K − π + π − final state, which proceeds through b → u or b → s quark-level transitions, and its CP-conjugated process, using data collected by the Large Hadron Collider beauty experiment 1 at the European Organization for Nuclear Research (CERN). The results reveal significant asymmetries between the decay rates of the $${\varLambda }_{0}^{b}$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:msubsup> <mml:mrow> <mml:mi>Λ</mml:mi> </mml:mrow> <mml:mrow> <mml:mn>0</mml:mn> </mml:mrow> <mml:mrow> <mml:mi>b</mml:mi> </mml:mrow> </mml:msubsup> </mml:math> baryon and its CP-conjugated antibaryon, providing, to our knowledge, the first observation of CP violation in baryon decays and demonstrating the different behaviours of baryons and antibaryons. In the Standard Model, CP violation arises from the Cabibbo–Kobayashi–Maskawa mechanism 2 , and new forces or particles beyond the Standard Model could provide further contributions. This discovery opens a new path in the search for physics beyond the Standard Model.