Institut de Robòtica i Informàtica Industrial
facilityBarcelona, Catalonia, Spain
Research output, citation impact, and the most-cited recent papers from Institut de Robòtica i Informàtica Industrial (Spain). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Institut de Robòtica i Informàtica Industrial
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non-)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and testing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.
This paper reviews the state-of-the art in the field of lock-in time-of-flight (ToF) cameras, their advantages, their limitations, the existing calibration methods, and the way they are being used, sometimes in combination with other sensors. Even though lock-in ToF cameras provide neither higher resolution nor larger ambiguity-free range compared to other range map estimation systems, advantages such as registered depth and intensity data at a high frame rate, compact design, low weight, and reduced power consumption have motivated their increasing usage in several research areas, such as computer graphics, machine vision, and robotics.
Low textured scenes are well known to be one of the main Achilles heels of geometric computer vision algorithms relying on point correspondences, and in particular for visual SLAM. Yet, there are many environments in which, despite being low textured, one can still reliably estimate line-based geometric primitives, for instance in city and indoor scenes, or in the so-called “Manhattan worlds”, where structured edges are predominant. In this paper we propose a solution to handle these situations. Specifically, we build upon ORB-SLAM, presumably the current state-of-the-art solution both in terms of accuracy as efficiency, and extend its formulation to simultaneously handle both point and line correspondences. We propose a solution that can even work when most of the points are vanished out from the input images, and, interestingly it can be initialized from solely the detection of line correspondences in three consecutive frames. We thoroughly evaluate our approach and the new initialization strategy on the TUM RGB-D benchmark and demonstrate that the use of lines does not only improve the performance of the original ORB-SLAM solution in poorly textured frames, but also systematically improves it in sequence frames combining points and lines, without compromising the efficiency.
Gait disorders can reduce the quality of life for people with neuromuscular impairments. Therefore, walking recovery is one of the main priorities for counteracting sedentary lifestyle, reducing secondary health conditions and restoring legged mobility. At present, wearable powered lower-limb exoskeletons are emerging as a revolutionary technology for robotic gait rehabilitation. This systematic review provides a comprehensive overview on wearable lower-limb exoskeletons for people with neuromuscular impairments, addressing the following three questions: (1) what is the current technological status of wearable lower-limb exoskeletons for gait rehabilitation?, (2) what is the methodology used in the clinical validations of wearable lower-limb exoskeletons?, and (3) what are the benefits and current evidence on clinical efficacy of wearable lower-limb exoskeletons? We analyzed 87 clinical studies focusing on both device technology (e.g., actuators, sensors, structure) and clinical aspects (e.g., training protocol, outcome measures, patient impairments), and make available the database with all the compiled information. The results of the literature survey reveal that wearable exoskeletons have potential for a number of applications including early rehabilitation, promoting physical exercise, and carrying out daily living activities both at home and the community. Likewise, wearable exoskeletons may improve mobility and independence in non-ambulatory people, and may reduce secondary health conditions related to sedentariness, with all the advantages that this entails. However, the use of this technology is still limited by heavy and bulky devices, which require supervision and the use of walking aids. In addition, evidence supporting their benefits is still limited to short-intervention trials with few participants and diversity among their clinical protocols. Wearable lower-limb exoskeletons for gait rehabilitation are still in their early stages of development and randomized control trials are needed to demonstrate their clinical efficacy.
This paper addresses the problem of 3D human pose estimation from a single image. We follow a standard two-step pipeline by first detecting the 2D position of the N body joints, and then using these observations to infer 3D pose. For the first step, we use a recent CNN-based detector. For the second step, most existing approaches perform 2N-to-3N regression of the Cartesian joint coordinates. We show that more precise pose estimates can be obtained by representing both the 2D and 3D human poses using NxN distance matrices, and formulating the problem as a 2D-to-3D distance matrix regression. For learning such a regressor we leverage on simple Neural Network architectures, which by construction, enforce positivity and symmetry of the predicted matrices. The approach has also the advantage to naturally handle missing observations and allowing to hypothesize the position of non-observed joints. Quantitative results on Humaneva and Human3.6M datasets demonstrate consistent performance gains over state-of-the-art. Qualitative evaluation on the images in-the-wild of the LSP dataset, using the regressor learned on Human3.6M, reveals very promising generalization results.
This article is an exhaustive revision of concepts and formulas related to quaternions and rotations in 3D space, and their proper use in estimation engines such as the error-state Kalman filter. The paper includes an in-depth study of the rotation group and its Lie structure, with formulations using both quaternions and rotation matrices. It makes special attention in the definition of rotation perturbations, derivatives and integrals. It provides numerous intuitions and geometrical interpretations to help the reader grasp the inner mechanisms of 3D rotation. The whole material is used to devise precise formulations for error-state Kalman filters suited for real applications using integration of signals from an inertial measurement unit (IMU).
Locating a robot from its distances, or range measurements, to three other known points or stations is a common operation, known as trilateration. This problem has been traditionally solved either by algebraic or numerical methods. An approach that avoids the direct algebrization of the problem is proposed here. Using constructive geometric arguments, a coordinate-free formula containing a small number of Cayley-Menger determinants is derived. This formulation accommodates a more thorough investigation of the effects caused by all possible sources of error, including round-off errors, for the first time in this context. New formulas for the variance and bias of the unknown robot location estimation, due to station location and range measurements errors, are derived and analyzed. They are proved to be more tractable compared with previous ones, because all their terms have geometric meaning, allowing a simple analysis of their asymptotic behavior near singularities.
Robots are becoming safe and smart enough to work alongside people not only on manufacturing production lines, but also in spaces such as houses, museums, or hospitals. This can be significantly exploited in situations in which a human needs the help of another person to perform a task, because a robot may take the role of the helper. In this sense, a human and the robotic assistant may cooperatively carry out a variety of tasks, therefore requiring the robot to communicate with the person, understand his/her needs, and behave accordingly. To achieve this, we propose a framework for a user to teach a robot collaborative skills from demonstrations. We mainly focus on tasks involving physical contact with the user, in which not only position, but also force sensing and compliance become highly relevant. Specifically, we present an approach that combines probabilistic learning, dynamical systems, and stiffness estimation to encode the robot behavior along the task. Our method allows a robot to learn not only trajectory following skills, but also impedance behaviors. To show the functionality and flexibility of our approach, two different testbeds are used: a transportation task and a collaborative table assembly.
Electrodermal activity (EDA) is indicative of psychological processes related to human cognition and emotions. Previous research has studied many methods for extracting EDA features; however, their appropriateness for emotion recognition has been tested using a small number of distinct feature sets and on different, usually small, data sets. In the current research, we reviewed 25 studies and implemented 40 different EDA features across time, frequency and time-frequency domains on the publicly available AMIGOS dataset. We performed a systematic comparison of these EDA features using three feature selection methods, Joint Mutual Information (JMI), Conditional Mutual Information Maximization (CMIM) and Double Input Symmetrical Relevance (DISR) and machine learning techniques. We found that approximately the same numbers of features are required to obtain the optimal accuracy for the arousal recognition and the valence recognition. Also, the subject-dependent classification results were significantly higher than the subject-independent classification for both arousal and valence recognition. Statistical features related to the Mel-Frequency Cepstral Coefficients (MFCC) were explored for the first time for the emotion recognition from EDA signals and they outperformed all other feature groups, including the most commonly used Skin Conductance Response (SCR) related features.
This article summarizes new aerial robotic manipulation technologies and methods-aerial robotic manipulators with dual arms and multidirectional thrusters-developed in the AEROARMS project for outdoor industrial inspection and maintenance (I&M).
Robots accompanying humans is one of the core capacities every service robot deployed in urban settings should have. We present a novel robot companion approach based on the so-called Social Force Model (SFM). A new model of robot-person interaction is obtained using the SFM which is suited for our robots Tibi and Dabo. Additionally, we propose an interactive scheme for robot's human-awareness navigation using the SFM and prediction information. Moreover, we present a new metric to evaluate the robot companion performance based on vital spaces and comfortableness criteria. Also, a multimodal human feedback is proposed to enhance the behavior of the system. The validation of the model is accomplished throughout an extensive set of simulations and real-life experiments.
We present a method combining affinity prediction with region agglomeration, which improves significantly upon the state of the art of neuron segmentation from electron microscopy (EM) in accuracy and scalability. Our method consists of a 3D U-Net, trained to predict affinities between voxels, followed by iterative region agglomeration. We train using a structured loss based on Malis, encouraging topologically correct segmentations obtained from affinity thresholding. Our extension consists of two parts: First, we present a quasi-linear method to compute the loss gradient, improving over the original quadratic algorithm. Second, we compute the gradient in two separate passes to avoid spurious gradient contributions in early training stages. Our predictions are accurate enough that simple learning-free percentile-based agglomeration outperforms more involved methods used earlier on inferior predictions. We present results on three diverse EM datasets, achieving relative improvements over previous results of 27, 15, and 250 percent. Our findings suggest that a single method can be applied to both nearly isotropic block-face EM data and anisotropic serial sectioned EM data. The runtime of our method scales linearly with the size of the volume and achieves a throughput of $\sim$∼ 2.6 seconds per megavoxel, qualifying our method for the processing of very large datasets.
In this paper, we analyze the fashion of clothing of a large social website. Our goal is to learn and predict how fashionable a person looks on a photograph and suggest subtle improvements the user could make to improve her/his appeal. We propose a Conditional Random Field model that jointly reasons about several fashionability factors such as the type of outfit and garments the user is wearing, the type of the user, the photograph's setting (e.g., the scenery behind the user), and the fashionability score. Importantly, our model is able to give rich feedback back to the user, conveying which garments or even scenery she/he should change in order to improve fashionability. We demonstrate that our joint approach significantly outperforms a variety of intelligent baselines. We additionally collected a novel heterogeneous dataset with 144,169 user posts containing diverse image, textual and meta information which can be exploited for our task. We also provide a detailed analysis of the data, showing different outfit trends and fashionability scores across the globe and across a span of 6 years.
One of the factors that have hindered progress in the areas of sign language recognition, translation, and production is the absence of large annotated datasets. Towards this end, we introduce How2Sign, a multimodal and multiview continuous American Sign Language (ASL) dataset, consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation. To evaluate the potential of How2Sign for real-world impact, we conduct a study with ASL signers and show that synthesized videos using our dataset can indeed be understood. The study further gives insights on challenges that computer vision should address in order to make progress in this field. Dataset website: http://how2sign.github.io/
The management of the urban water cycle (UWC) is a subject of increasing interest because of its social, economic, and environmental impact. The most important issues include the sustainable use of limited resources and the reliability of service to consumers with adequate quality and pressure levels, as well as the urban drainage management to prevent flooding and polluting discharges to the environment.
Snoring, a symptom which may indicate the presence of the obstructive sleep apnoea syndrome (OSA), is also common in the general population. Recent studies have suggested that the acoustic characteristics of snoring sound may differ between simple snorers and OSA patients. We have studied a small number of patients with simple snoring and OSA, analysing the acoustic characteristics of the snoring sound. Seventeen male patients, 10 with OSA (apnoea/hypopnoea index (AHI) 26.2 events x h(-1)) and seven simple snorers (AHI 3.8 events x h(-1)), were studied. Full night polysomnography was performed and the snoring sound power spectrum was analysed. Spectral analysis of snoring sound showed the existence of two different patterns. The first pattern was characterized by the presence of a fundamental frequency and several harmonics. The second pattern was characterized by a low frequency peak with the sound energy scattered on a narrower band of frequencies, but without clearly identified harmonics. The seven simple snorers and two of the 10 patients with OSA (AIH 13 and 14 events x h(-1), respectively) showed the first pattern. The rest of the OSA patients showed the second pattern. The peak frequency of snoring was significantly lower in OSA patients, with all but one OSA patient and only one simple snorer showing a peak frequency below 150 Hz. A significant negative correlation was found between AHI and peak and mean frequencies of the snoring power spectrum (p<0.0016 and p<0.0089, respectively). In conclusion, this study demonstrates significant differences in the sound power spectrum of snoring sound between subjects with simple snoring and obstructive sleep apnoea patients.
We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it in- tegrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead, and sec- ond, its scalability to arbitrarily large number of correspon- dences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous sys- tem where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive exper- imental evaluation will show that our solution yields accu- rate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.
Pose SLAM is the variant of simultaneous localization and map building (SLAM) is the variant of SLAM, in which only the robot trajectory is estimated and where landmarks are only used to produce relative constraints between robot poses. To reduce the computational cost of the information filter form of Pose SLAM and, at the same time, to delay inconsistency as much as possible, we introduce an approach that takes into account only highly informative loop-closure links and nonredundant poses. This approach includes constant time procedures to compute the distance between poses, the expected information gain for each potential link, and the exact marginal covariances while moving in open loop, as well as a procedure to recover the state after a loop closure that, in practical situations, scales linearly in terms of both time and memory. Using these procedures, the robot operates most of the time in open loop, and the cost of the loop closure is amortized over long trajectories. This way, the computational bottleneck shifts to data association, which is the search over the set of previously visited poses to determine good candidates for sensor registration. To speed up data association, we introduce a method to search for neighboring poses whose complexity ranges from logarithmic in the usual case to linear in degenerate situations. The method is based on organizing the pose information in a balanced tree whose internal levels are defined using interval arithmetic. The proposed Pose-SLAM approach is validated through simulations, real mapping sessions, and experiments using standard SLAM data sets.
Trabajo presentado en la IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), celebrada de forma virtual desde Nashville, TN (Estados Unidos), del 20 al 25 de junio de 2021