State Key Laboratory of Virtual Reality Technology and Systems
facilityBeijing, China
Research output, citation impact, and the most-cited recent papers from State Key Laboratory of Virtual Reality Technology and Systems. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from State Key Laboratory of Virtual Reality Technology and Systems
Local binary pattern (LBP) is a nonparametric descriptor, which efficiently summarizes the local structures of images. In recent years, it has aroused increasing interest in many areas of image processing and computer vision and has shown its effectiveness in a number of applications, in particular for facial image analysis, including tasks as diverse as face detection, face recognition, facial expression analysis, and demographic classification. This paper presents a comprehensive survey of LBP methodology, including several more recent variations. As a typical application of the LBP approach, LBP-based facial image analysis is extensively reviewed, while its successful extensions, which deal with various tasks of facial image analysis, are also highlighted.
The anomaly detection in photovoltaic (PV) cell electroluminescence (EL) image is of great significance for the vision-based fault diagnosis. Many researchers are committed to solving this problem, but a large-scale open-world dataset is required to validate their novel ideas. We build a PV EL Anomaly Detection (PVEL-AD <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1, 2, 3</sup> ) dataset for polycrystalline solar cell, which contains 36 543 near-infrared images with various internal defects and heterogeneous background. This dataset contains anomaly free images and anomalous images with ten different categories. Moreover, 37 380 ground truth bounding boxes are provided for eight types of defects. We also carry out a comprehensive evaluation of the state-of-the-art object detection methods based on deep learning. The evaluation results on this dataset provide the initial benchmark, which is convenient for follow-up researchers to conduct experimental comparisons. To the best of our knowledge, this is the first public dataset for PV solar cell anomaly detection that provides box-wise ground truth. Furthermore, this dataset can also be used for the evaluation of many computer vision tasks such as few-shot detection, one-class classification, and anomaly generation.
Here, the attitude control of a quadrotor aircraft subject to a class of disturbances is studied. Unlike disturbances mentioned in most of the existing literature, the disturbance considered here is time varying and non-vanished. An extended observer is designed to estimate the disturbance by treating it as a new unknown state. Based on the estimation, a feedback controller with a sliding mode term is designed to stabilise the attitude of the quadrotor. Furthermore, to avoid the discontinuity of the control law caused by the sliding mode term, a modified sliding mode term is designed. The resulting continuous feedback controller makes the attitude error uniformly ultimate bounded. Theoretical results are confirmed by numerical simulations.
Though image-level weakly supervised semantic seg-mentation (WSSS) has achieved great progress with Class Activation Maps (CAMs) as the cornerstone, the large su-pervision gap between classification and segmentation still hampers the model to generate more complete and precise pseudo masks for segmentation. In this study, we propose weakly-supervised pixel-to-prototype contrast that can provide pixel-level supervisory signals to narrow the gap. Guided by two intuitive priors, our method is executed across different views and within per single view of an image, aiming to impose cross-view feature semantic consistency regularization and facilitate intra(inter)-class compactness(dispersion) of the feature space. Our method can be seamlessly incorporated into existing WSSS models with-out any changes to the base networks and does not incur any extra inference burden. Extensive experiments manifest that our method consistently improves two strong baselines by large margins, demonstrating the effectiveness. Specifically, built on top of SEAM, we improve the initial seed mIoU on PASCAL VOC 2012 from 55.4% to 61.5%. Moreover, armed with our method, we increase the segmentation mIoU of EPS from 70.8% to 73.6%, achieving new state-of-the-art.
Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head.
Point cloud filtering is a fundamental problem in geometry modeling and processing. Despite of significant advancement in recent years, the existing methods still suffer from two issues: 1) they are either designed without preserving sharp features or less robust in feature preservation; and 2) they usually have many parameters and require tedious parameter tuning. In this article, we propose a novel deep learning approach that automatically and robustly filters point clouds by removing noise and preserving their sharp features. Our point-wise learning architecture consists of an encoder and a decoder. The encoder directly takes points (a point and its neighbors) as input, and learns a latent representation vector which goes through the decoder to relate the ground-truth position with a displacement vector. The trained neural network can automatically generate a set of clean points from a noisy input. Extensive experiments show that our approach outperforms the state-of-the-art deep learning techniques in terms of both visual quality and quantitative error metrics. The source code and dataset can be found at https://github.com/dongbo-BUAA-VR/Pointfilter.
Hyperspectral (HS) pansharpening intends to synthesize a HS image with a registered panchromatic image, to generate an enhanced image with simultaneous high spectral resolution and high spatial resolution. However, the spectral range gap between the two kinds of images and the need to resolve details for many continuous narrow bands make the technique prone to spectral distortion and spatial blurring. To mitigate the problems, we propose a new HS pansharpening framework via spectrally predictive convolutional neural networks (HyperPNN). In our proposed HyperPNN, spectrally predictive structure is introduced to strengthen the spectral prediction capability of a pansharpening network. Following the concept of the proposed HyperPNN, two specific pansharpening convolutional neural network (CNN) models, i.e., HyperPNN1 and HyperPNN2, are designed. Experimental results from three datasets suggest the excellent performance of our CNN-based HS pansharpening methods.
Microsoft Kinect uses a built-in RGB-D sensor and the skeleton tracking algorithm to capture 3-D movements of the human body. It also has the potential for assessing postural stability, which is fundamental for most motor activities. The aim of this paper was to investigate whether standing balance can be evaluated reliably and validly by this low-cost device. Nine healthy subjects were required to maintain balance during three standing positions (double limb stance with feet apart, double limb stance with feet together and single limb stance). The center of mass (COM) was calculated from the body's kinematic data acquired by the Kinect system and Optotrak Certus motion capture system. The position variability and average velocity of the COM in the horizontal plane were calculated and used to evaluate the subject's balance. These COM parameters from the two systems showed excellent and comparable test-retest reliability (intraclass correlation coefficient ). In addition, although the average velocity of the COM calculated from Kinect was significantly lower, each COM parameter showed excellent concurrent validity and a significant linear relationship existed between the two systems, which meant that biases may be corrected using linear calibration equations. Therefore, Kinect may be a valid, reliable, and convenient device for assessing standing balance when its measured COM parameters are properly calibrated.
The simultaneous recognition of multiple objects in one image remains a challenging task, spanning multiple events in the recognition field such as various object scales, inconsistent appearances, and confused inter-class relationships. Recent research efforts mainly resort to the statistic label co-occurrences and linguistic word embedding to enhance the unclear semantics. Different from these researches, in this paper, we propose a novel Transformer-based Dual Relation learning framework, constructing complementary relationships by exploring two aspects of correlation, i.e., structural relation graph and semantic relation graph. The structural relation graph aims to capture long-range correlations from object context, by developing a cross-scale transformer-based architecture. The semantic graph dynamically models the semantic meanings of image objects with explicit semantic-aware constraints. In addition, we also incorporate the learnt structural relationship into the semantic graph, constructing a joint relation graph for robust representations. With the collaborative learning of these two effective relation graphs, our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks, i.e. MS-COCO and VOC 2007 dataset.
A major bottleneck of pedestrian detection lies on the sharp performance deterioration in the presence of small-size pedestrians that are relatively far from the camera. Motivated by the observation that pedestrians of disparate spatial scales exhibit distinct visual appearances, we propose in this paper an active pedestrian detector that explicitly operates over multiple-layer neuronal representations of the input still image. More specifically, convolutional neural nets, such as ResNet and faster R-CNNs, are exploited to provide a rich and discriminative hierarchy of feature representations, as well as initial pedestrian proposals. Here each pedestrian observation of distinct size could be best characterized in terms of the ResNet feature representation at a certain layer of the hierarchy. Meanwhile, initial pedestrian proposals are attained by the faster R-CNNs techniques, i.e., region proposal network and follow-up region of interesting pooling layer employed right after the specific ResNet convolutional layer of interest, to produce joint predictions on the bounding-box proposals' locations and categories (i.e., pedestrian or not). This is engaged as an input to our active detector, where for each initial pedestrian proposal, a sequence of coordinate transformation actions is carried out to determine its proper x-y 2D location and the layer of feature representation, or eventually terminated as being background. Empirically our approach is demonstrated to produce overall lower detection errors on widely used benchmarks, and it works particularly well with far-scale pedestrians. For example, compared with 60.51% log-average miss rate of the state-of-the-art MS-CNN for far-scale pedestrians (those below 80 pixels in bounding-box height) of the Caltech benchmark, the miss rate of our approach is 41.85%, with a notable reduction of 18.66%.
There are two sides to every story of visual saliency modeling in the frequency domain. On the one hand, image saliency can be effectively estimated by applying simple operations to the frequency spectrum. On the other hand, it is still unclear which part of the frequency spectrum contributes the most to popping-out targets and suppressing distractors. Toward this end, this paper tentatively explores the secret of image saliency in the frequency domain. From the results obtained in several qualitative and quantitative experiments, we find that the secret of visual saliency may mainly hide in the phases of intermediate frequencies. To explain this finding, we reinterpret the concept of discrete Fourier transform from the perspective of template-based contrast computation and thus develop several principles for designing the saliency detector in the frequency domain. Following these principles, we propose a novel approach to design the saliency detector under the assistance of prior knowledge obtained through both unsupervised and supervised learning processes. Experimental results on a public image benchmark show that the learned saliency detector outperforms 18 state-of-the-art approaches in predicting human fixations.
Pan-sharpening aims at producing a high-resolution (HR) multi-spectral (MS) image from a low-resolution (LR) multi-spectral (MS) image and its corresponding panchromatic (PAN) image acquired by a same satellite. Inspired by a new fashion in recent deep learning community, we propose a novel Transformer based model for pan-sharpening. We explore the potential of Transformer in image feature extraction and fusion. Following the successful development of vision transformers, we design a two-stream network with the self-attention to extract the modality-specific features from the PAN and MS modalities and apply a cross-attention module to merge the spectral and spatial features. The pan-sharpened image is produced from the enhanced fused features. Extensive experiments on GaoFen-2 and WorldView-3 images demonstrate that our Transformer based model achieves impressive results and outperforms many existing CNN based methods, which shows the great potential of introducing Transformer to the pan-sharpening task. Codes are available at https://github.com/zhysora/PanFormer.
The crucial problem in vehicle re-identification is to find the same vehicle identity when reviewing this object from cross-view cameras, which sets a higher demand for learning viewpoint-invariant representations. In this paper, we propose to solve this problem from two aspects: constructing robust feature representations and proposing camera-sensitive evaluations. We first propose a novel Heterogeneous Relational Complement Network (HRCN) by incorporating region-specific features and cross-level features as complements for the original high-level output. Considering the distributional differences and semantic misalignment, we propose graph-based relation modules to embed these heterogeneous features into one unified high-dimensional space. On the other hand, considering the deficiencies of cross-camera evaluations in existing measures (i.e., CMC and AP), we then propose a Cross-camera Generalization Measure (CGM) to improve the evaluations by introducing position-sensitivity and cross-camera generalization penalties. We further construct a new benchmark of existing models with our proposed CGM and experimental results reveal that our proposed HRCN model achieves new state-of-the-art in VeRi-776, VehicleID, and VERI-Wild.
In this paper, a competitive method for 3-D face recognition (FR) using spherical harmonic features (SHF) is proposed. With this solution, 3-D face models are characterized by the energies contained in spherical harmonics with different frequencies, thereby enabling the capture of both gross shape and fine surface details of a 3-D facial surface. This is in clear contrast to most 3-D FR techniques which are either holistic or feature based, using local features extracted from distinctive points. First, 3-D face models are represented in a canonical representation, namely, spherical depth map, by which SHF can be calculated. Then, considering the predictive contribution of each SHF feature, especially in the presence of facial expression and occlusion, feature selection methods are used to improve the predictive performance and provide faster and more cost-effective predictors. Experiments have been carried out on three public 3-D face datasets, SHREC2007, FRGC v2.0, and Bosphorus, with increasing difficulties in terms of facial expression, pose, and occlusion, and which demonstrate the effectiveness of the proposed method.
As the next generation standard of video coding, the High Efficiency Video Coding (HEVC) achieves significantly better coding efficiency than all existing video coding standards. A Coding Unit (CU) quad tree concept is introduced to HEVC to improve the coding efficiency. Each CU node in quad tree will be traversed by depth first search process to find the best Coding Tree Unit (CTU) partition. Although this quad tree search process can obtain the best CTU partition, it is very time consuming, especially in interframe coding. To alleviate the encoder computation load in interframe coding, a fast CU depth decision method is proposed by reducing the depth search range. Based on the depth information correlation between spatio-temporal adjacent CTUs and the current CTU, some depths can be adaptively excluded from the depth search process in advance. Experimental results show that the proposed scheme provides almost 30% encoder time savings on average compared to the default encoding scheme in HM8.0 with only 0.38% bit rate increment in coding performance.
In this paper, we propose a transformer based approach for visual grounding. Unlike existing proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage detector by fusing textual embeddings, our approach is built on top of a transformer encoder-decoder and is independent of any pretrained detectors or word embedding models. Termed as VGTR – Visual Grounding with TRansformers, our approach is designed to learn semantic-discriminative visual features under the guidance of the textual description without harming their location ability. This information flow enables our VGTR to have a strong capability in capturing context-level semantics of both vision and language modalities, rendering us to aggregate accurate visual clues implied by the description to locate the interested object instance. Experiments show that our method outperforms state-of-the-art proposal-free approaches by a considerable margin on four benchmarks.
Human action recognition in videos has been extensively studied in recent years due to its wide range of applications. Instead of classifying video sequences into a number of action categories, in this paper, we focus on a particular problem of action similarity labeling (ASLAN), which aims at verifying whether a pair of videos contain the same type of action or not. To address this challenge, a novel approach called compressive sequential learning (CSL) is proposed by leveraging the compressive sensing theory and sequential learning. We first project data points to a low-dimensional space by effectively exploring an important property in compressive sensing: the restricted isometry property. In particular, a very sparse measurement matrix is adopted to reduce the dimensionality efficiently. We then learn an ensemble classifier for measuring similarities between pairwise videos by iteratively minimizing its empirical risk with the AdaBoost strategy on the training set. Unlike conventional AdaBoost, the weak learner for each iteration is not explicitly defined and its parameters are learned through greedy optimization. Furthermore, an alternative of CSL named compressive sequential encoding is developed as an encoding technique and followed by a linear classifier to address the similarity-labeling problem. Our method has been systematically evaluated on four action data sets: ASLAN, KTH, HMDB51, and Hollywood2, and the results show the effectiveness and superiority of our method for ASLAN.
Hua'er, a type of traditional oral performance, is one of the national intangible cultural heritages (ICH) in China. Experts have been trying to enhance public's awareness of Hua'er protection through digital documentation technology, but there's no efficacious means to attract interest and popularize knowledge yet. In this paper, we propose an interactive VR system engaging audiences to experience and understand the connotation of Hua'er performance. Based on an online survey with the public and in-depth interviews with Hua'er experts, we derive a set of design requirements for generating embodied storytelling and interactive experience in VR. Accordingly, we design “Hua'er and the Youth” (HY) that integrates three methods (virtual avatar, participatory performance, and game-based knowledge acquisition) and conduct a between-subjects user study to compare HY with the Comparison system. The results suggest that our methods significantly improved audience's interactive experience, knowledge level and the awareness of ICH safeguarding.
Personalized federated learning (PFL) reduces the impact of non-independent and identically distributed (non-IID) data among clients by allowing each client to train a personalized model when collaborating with others. A key question in PFL is to decide which parameters of a client should be localized or shared with others. In current mainstream approaches, all layers that are sensitive to non-IID data (such as classifier layers) are generally personalized. The reasoning behind this approach is understandable, as localizing parameters that are easily influenced by non-IID data can prevent the potential negative effect of collaboration. However, we believe that this approach is too conservative for collaboration. For example, for a certain client, even if its parameters are easily influenced by non-IID data, it can still benefit by sharing these parameters with clients having similar data distribution. This observation emphasizes the importance of considering not only the sensitivity to non-IID data but also the similarity of data distribution when determining which parameters should be localized in PFL. This paper introduces a novel guideline for client collaboration in PFL. Unlike existing approaches that prohibit all collaboration of sensitive parameters, our guideline allows clients to share more parameters with others, leading to improved model performance. Additionally, we propose a new PFL method named FedCAC, which employs a quantitative metric to evaluate each parameter’s sensitivity to non-IID data and carefully selects collaborators based on this evaluation. Experimental results demonstrate that FedCAC enables clients to share more parameters with others, resulting in superior performance compared to state-of-the-art methods, particularly in scenarios where clients have diverse distributions. The code is integrated into our FL training framework: https://github.com/kxzxvbk/Fling.
Seizure prediction of epileptic preictal period through electroencephalogram (EEG) signals is important for clinical epilepsy diagnosis. However, recent deep learning-based methods commonly employ intra-subject training strategy and need sufficient data, which are laborious and time-consuming for a practical system and pose a great challenge for seizure predicting. Besides, multi-domain characterizations, including spatio-temporal-spectral dependencies in an epileptic brain are generally neglected or not considered simultaneously in current approaches, and this insufficiency commonly leads to suboptimal seizure prediction performance. To tackle the above issues, in this paper, we propose Contrastive Learning for Epileptic seizure Prediction (CLEP) using a Spatio-Temporal-Spectral Network (STS-Net). Specifically, the CLEP learns intrinsic epileptic EEG patterns across subjects by contrastive learning. The STS-Net extracts multi-scale temporal and spectral representations under different rhythms from raw EEG signals. Then, a novel triple attention layer (TAL) is employed to construct inter-dimensional interaction among multi-domain features. Moreover, a spatio dynamic graph convolution network (sdGCN) is proposed to dynamically model the spatial relationships between electrodes and aggregate spatial information. The proposed CLEP-STS-Net achieves a sensitivity of 96.7% and a false prediction rate of 0.072/h on the CHB-MIT scalp EEG database. We also validate the proposed method on clinical intracranial EEG (iEEG) database from our Xuanwu Hospital of Capital Medical University, and the predicting system yielded a sensitivity of 95%, a false prediction rate of 0.087/h. The experimental results outperform the state-of-the-art studies which validate the efficacy of our method. Our code is available at https://github.com/LianghuiGuo/CLEP-STS-Net.