State Key Laboratory of Software Engineering
facilityHubei, China
Research output, citation impact, and the most-cited recent papers from State Key Laboratory of Software Engineering. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from State Key Laboratory of Software Engineering
Deep networks have achieved excellent performance in learning representation from visual data. However, the supervised deep models like convolutional neural network require large quantities of labeled data, which are very expensive to obtain. To solve this problem, this paper proposes an unsupervised deep network, called the stacked convolutional denoising auto-encoders, which can map images to hierarchical representations without any label information. The network, optimized by layer-wise training, is constructed by stacking layers of denoising auto-encoders in a convolutional way. In each layer, high dimensional feature maps are generated by convolving features of the lower layer with kernels learned by a denoising auto-encoder. The auto-encoder is trained on patches extracted from feature maps in the lower layer to learn robust feature detectors. To better train the large network, a layer-wise whitening technique is introduced into the model. Before each convolutional layer, a whitening layer is embedded to sphere the input data. By layers of mapping, raw images are transformed into high-level feature representations which would boost the performance of the subsequent support vector machine classifier. The proposed algorithm is evaluated by extensive experimentations and demonstrates superior classification performance to state-of-the-art unsupervised networks.
Context: Software defect prediction (SDP) is an important challenge in the field of software engineering, hence much research work has been conducted, most notably through the use of machine learning algorithms. However, class-imbalance typified by few defective components and many non-defective ones is a common occurrence causing difficulties for these methods. Imbalanced learning aims to deal with this problem and has recently been deployed by some researchers, unfortunately with inconsistent results. Objective: We conduct a comprehensive experiment to explore (a) the basic characteristics of this problem; (b) the effect of imbalanced learning and its interactions with (i) data imbalance, (ii) type of classifier, (iii) input metrics and (iv) imbalanced learning method. Method: We systematically evaluate 27 data sets, 7 classifiers, 7 types of input metrics and 17 imbalanced learning methods (including doing nothing) using an experimental design that enables exploration of interactions between these factors and individual imbalanced learning algorithms. This yields 27 × 7 × 7 × 17 = 22491 results. The Matthews correlation coefficient (MCC) is used as an unbiased performance measure (unlike the more widely used F1 and AUC measures). Results: (a) we found a large majority (87 percent) of 106 public domain data sets exhibit moderate or low level of imbalance (imbalance ratio <; 10; median = 3.94); (b) anything other than low levels of imbalance clearly harm the performance of traditional learning for SDP; (c) imbalanced learning is more effective on the data sets with moderate or higher imbalance, however negative results are always possible; (d) type of classifier has most impact on the improvement in classification performance followed by the imbalanced learning method itself. Type of input metrics is not influential. (e) only 52% of the combinations of Imbalanced Learner and Classifier have a significant positive effect. Conclusion: This paper offers two practical guidelines. First, imbalanced learning should only be considered for moderate or highly imbalanced SDP data sets. Second, the appropriate combination of imbalanced method and classifier needs to be carefully chosen to ameliorate the imbalanced learning problem for SDP. In contrast, the indiscriminate application of imbalanced learning can be harmful.
Person re-identification has been widely studied due to its importance in surveillance and forensics applications. In practice, gallery images are high-resolution (HR) while probe images are usually low-resolution (LR) in the identification scenarios with large variation of illumination, weather or quality of cameras. Person re-identification in this kind of scenarios, which we call super-resolution (SR) person re-identification, has not been well studied. In this paper, we propose a semi-coupled low-rank discriminant dictionary learning (SLD <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> L) approach for SR person re-identification. For the given training image set which consists of HR gallery and LR probe images, we aim to convert the features of LR images into discriminating HR features. Specifically, our approach learns a pair of HR and LR dictionaries and a mapping from the features of HR gallery images and LR probe images. To ensure that the converted features using the learned dictionaries and mapping have favorable discriminative capability, we design a discriminant term which requires the converted HR features of LR probe images should be close to the features of HR gallery images from the same person, but far away from the features of HR gallery images from different persons. In addition, we apply low-rank regularization in dictionary learning procedure such that the learned dictionaries can well characterize intrinsic feature space of HR and LR images. Experimental results on public datasets demonstrate the effectiveness of SLD <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> L.
Recent years has witnessed growing interest in hyperspectral image (HSI) processing. In practice, however, HSIs always suffer from huge data size and mass of redundant information, which hinder their application in many cases. HSI compression is a straightforward way of relieving these problems. However, most of the conventional image encoding algorithms mainly focus on the spatial dimensions, and they need not consider the redundancy in the spectral dimension. In this paper, we propose a novel HSI compression and reconstruction algorithm via patch-based low-rank tensor decomposition (PLTD). Instead of processing the HSI separately by spectral channel or by pixel, we represent each local patch of the HSI as a third-order tensor. Then, the similar tensor patches are grouped by clustering to form a fourth-order tensor per cluster. Since the grouped tensor is assumed to be redundant, each cluster can be approximately decomposed to a coefficient tensor and three dictionary matrices, which leads to a low-rank tensor representation of both the spatial and spectral modes. The reconstructed HSI can then be simply obtained by the product of the coefficient tensor and dictionary matrices per cluster. In this way, the proposed PLTD algorithm simultaneously removes the redundancy in both the spatial and spectral domains in a unified framework. The extensive experimental results on various public HSI datasets demonstrate that the proposed method outperforms the traditional image compression approaches and other tensor-based methods.
By exploiting multipath fading channels as a source of common randomness, physical layer (PHY) based key generation protocols allow two terminals with correlated observations to generate secret keys with information-theoretical security. The state of the art, however, still suffers from major limitations,e.g., low key generation rate, lower entropy of key bits and a high reliance on node mobility. In this paper, a novel cooperative key generation protocol is developed to facilitate high-rate key generation in narrowband fading channels, where two keying nodes extract the phase randomness of the fading channel with the aid of relay node(s). For the first time, we explicitly consider the effect of estimation methods on the extraction of secret key bits from the underlying fading channels and focus on a popular statistical method - maximum likelihood estimation (MLE). The performance of the cooperative key generation scheme is extensively evaluated theoretically. We successfully establish both a theoretical upper bound on the maximum secret key rate from mutual information of correlated random sources and a more practical upper bound from Cramer-Rao bound (CRB) in estimation theory. Numerical examples and simulation studies are also presented to demonstrate the performance of the cooperative key generation system. The results show that the key rate can be improved by a couple of orders of magnitude compared to the existing approaches.
Twitter has attracted millions of users to share and disseminate most up-to-date information, resulting in large volumes of data produced everyday. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg. By splitting tweets into meaningful segments, the semantic or context information is well preserved and easily extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e., global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets, respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. Experiments on two tweet data sets show that tweet segmentation quality is significantly improved by learning both global and local contexts compared with using global context alone. Through analysis and comparison, we show that local linguistic features are more reliable for learning local context compared with term-dependency. As an application, we show that high accuracy is achieved in named entity recognition by applying segment-based part-of-speech (POS) tagging.
In this paper, the DMOEA-DD, which is an improvement of DMOEA by using domain decomposition technique, is applied to tackle the CEC 2009 MOEA competition test instances that are multiobjective optimization problems (MOPs) with complicated Pareto set (PS) geometry shapes. The performance assessment is given by using IGD as performance metric.
Clustering is one of the research hotspots in the field of data mining and has extensive applications in practice. Recently, Rodriguez and Laio [1] published a clustering algorithm on Science that identifies the clustering centers in an intuitive way and clusters objects efficiently and effectively. However, the algorithm is sensitive to a preassigned parameter and suffers from the identification of the “ideal” number of clusters. To overcome these shortages, this paper proposes a new clustering algorithm that can detect the clustering centers automatically via statistical testing. Specifically, the proposed algorithm first defines a new metric to measure the density of an object that is more robust to the preassigned parameter, further generates a metric to evaluate the centrality of each object. Afterwards, it identifies the objects with extremely large centrality metrics as the clustering centers via an outward statistical testing method. Finally, it groups the remaining objects into clusters containing their nearest neighbors with higher density. Extensive experiments are conducted over different kinds of clustering data sets to evaluate the performance of the proposed algorithm and compare with the algorithm in Science. The results show the effectiveness and robustness of the proposed algorithm.
Nowadays gigantic crowd-sourced data collected from mobile phone users have become widely available, which enables the possibility of many important data mining applications to improve the quality of our daily lives. While providing tremendous benefits, the release of these data to the public will pose a considerable threat to mobile users' privacy. To solve this problem, the notion of differential privacy has been proposed to provide privacy with theoretical guarantee, and recently it has been applied in streaming data publishing. However, most of the existing literature focus on either event-level privacy on infinite streams or user-level privacy on finite streams. In this paper, we investigate the problem of real-time spatiotemporal crowd-sourced data publishing with privacy preservation. Specifically, we consider continuous publication of population statistics for monitoring purposes and design RescueDP-an online aggregate monitoring scheme over infinite streams with privacy guarantee. RescueDP's key components include adaptive sampling, adaptive budget allocation, dynamic grouping, perturbation and filtering, which are seamlessly integrated as a whole to provide privacy-preserving statistics publishing on infinite time stamps. We show that RescueDP can achieve w-event privacy over data generated and published periodically by crowd users. We evaluate our scheme with real-world as well as synthetic datasets and compare it with two w-event privacy-assured representative benchmarks. Experimental results show that our solution outperforms the existing methods and improves the utility with strong privacy guarantee.
Conserving the energy consumption of large data centers is of critical significance, where a few percent in consumption reduction translates into millions-dollar savings. This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS). We aim at minimizing the total energy consumption of processing a sequence of real-time tasks under deadline constraints. We compute the appropriate voltage/frequency setting for each task through mathematical optimization, and assign multiple tasks to the cluster with heuristic scheduling algorithms. In performance evaluation driven by real-world power measurement traces, our scheduling algorithm shows comparable energy savings to the theoretical upper bound. With a GPU scaling interval where analytically at most 38% of energy can be saved, we record 30-36% of energy savings. Our results are applicable to energy management on modern heterogeneous clusters. In particular, our model stresses the nonlinear relationship between task execution time and processor speed for GPU-accelerated applications, for more accurately capturing real-world GPU energy consumption.
Network Function Virtualization (NFV) is emerging as a new paradigm for providing elastic network functions through flexible virtual network function (VNF) instances executed on virtualized computing platforms exemplified by cloud datacenters. In the new NFV market, well defined VNF instances each realize an atomic function that can be chained to meet user demands in practice. This work studies the dynamic market mechanism design for the transaction of VNF service chains in the NFV market, to help relinquish the full power of NFV. Combining the techniques of primal-dual approximation algorithm design with Myerson's characterization of truthful mechanisms, we design a VNF chain auction that runs efficiently in polynomial time, guarantees truthfulness, and achieves near-optimal social welfare in the NFV eco-system. Extensive simulation studies verify the efficacy of our auction mechanism.
Video-based person re-identification (re-id) is an important application in practice. Since large variations exist between different pedestrian videos, as well as within each video, it's challenging to conduct re-identification between pedestrian videos. In this paper, we propose a simultaneous intra-video and inter-video distance learning (SI2DL) approach for video-based person re-id. Specifically, SI2DL simultaneously learns an intravideo distance metric and an inter-video distance metric from the training videos. The intra-video distance metric is used to make each video more compact, and the inter-video one is used to ensure that the distance between truly matching videos is smaller than that between wrong matching videos. Considering that the goal of distance learning is to make truly matching video pairs from different persons be well separated with each other, we also propose a pair separation based SI2DL (P-SI2DL). P-SI2DL aims to learn a pair of distance metrics, under which any two truly matching video pairs can be well separated. Experiments on four public pedestrian image sequence datasets show that our approaches achieve the state-of-the-art performance.
Person re-identification, as an important task in video surveillance and forensics applications, has been widely studied. But most of previous approaches are based on the key assumption that images for comparison have the same resolution and a uniform scale. Some recent works investigate how to match low resolution query images against high resolution gallery images, but still assume that the low-resolution query images have the same scale. In real scenarios, person images may not only be with low-resolution but also have different scales. Through investigating the distance variation behavior by changing image scales, we observe that scale-distance functions, generated by image pairs under different scales from the same person or different persons, are distinguishable and can be classified as feasible (for a pair of images from the same person) or infeasible (for a pair of images from different persons). The scale-distance functions are further represented by parameter vectors in the scale-distance function space. On this basis, we propose to learn a discriminating surface separating these feasible and infeasible functions in the scale-distance function space, and use it for reidentifying persons. Experimental results on two simulated datasets and one public dataset demonstrate the effectiveness of the proposed framework.
As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.
Classes are the basic modules in object-oriented (OO) software, which consist of attributes and methods. Thus, in an OO environment, cohesion mainly concerns how tight the attributes and methods of classes are. This paper discusses the relationships between attributes and attributes, attributes and methods, and methods and methods of a class based on dependence analysis. Then we discuss the properties of these relationships. According to these properties, this paper proposes a novel approach to measuring class cohesion. Our approach overcomes the limitations of previous class cohesion measures, which consider only one or two of the three relationships in a class. We also prove that this measure satisfies the properties that a good measurement should have.
Software architecture (SA) documentation provides a blueprint of a software-intensive system for the communication between stakeholders about the high-level design of the system. In open source software (OSS) development, a lack of SA documentation may hinder the use and further development of OSS, but how much "architecture" documentation is enough and appropriate is largely dependent on the contextual factors of development. In order to understand the state of the practice of SA documentation in OSS projects, we conducted a documentation-based survey to explore how SA is documented in OSS projects. Out of 2,000 OSS projects from four major OSS sources, we found that 108 projects have some SA documentation, which shows that the SA documentation is scarce in OSS development. We analyzed these 108 projects to understand what SA information has been documented and how they have been described. We have found that frequently-documented architectural information is model, system, and mission, natural language is the most frequently-used architectural language for specifying architectural information in OSS SA documents. The results also show that the likelihood that an OSS project will document SA is increased when more developers are involved in the project, and industry and research OSS projects are more likely to create SA documents than freelance projects.
Extracting adverse drug events receives much research attention in the biomedical community. Previous work adopts pipeline models, firstly recognizing drug/disease entity mentions and then identifying adverse drug events from drug/disease pairs. In this paper, we investigate joint models for simultaneously extracting drugs, diseases and adverse drug events. Compared with pipeline models, joint models have two main advantages. First, they make use of information integration to facilitate performance improvement; second, they reduce error propagation in pipeline methods. We compare a discrete model and a deep neural model for extracting drugs, diseases and adverse drug events jointly. Experimental results on a standard ADE corpus show that the discrete joint model outperforms a state-of-the-art baseline pipeline significantly. In addition, when discrete features are replaced by neural features, the recall is further improved.
Software developers can search, share and learn development experience, solutions, bug fixes and open source projects in software information sites such as StackOverflow and Freecode. Many software information sites rely on tags to classify their contents, i.e. software objects, in order to improve the performance and accuracy of various operations on the sites. The quality of tags thus has a significant impact on the usefulness of these sites. High quality tags are expected to be concise and can describe the most important features of the software objects. Unfortunately tagging is inherently an uncoordinated process. The choice of tags made by individual software developers is dependent not only on a developer's understanding of the software object but also on the developer's English skills and preferences. As a result, the number of different tags grows rapidly along with continuous addition of software objects. With thousands of different tags, many of which introduce noise, software objects become poorly classified. Such phenomenon affects negatively the speed and accuracy of developers' queries. In this paper, we propose a tool called TagMulRec to automatically recommend tags and classify software objects in evolving large-scale software information sites. Given a new software object, TagMulRec locates the software objects that are semantically similar to the new one and exploit their tags. We have evaluated TagMulRec on four software information sites, StackOverflow, AskUbuntu, AskDifferent and Freecode. According to our empirical study, TagMulRec is not only accurate but also scalable that can handle a large-scale software information site with millions of software objects and thousands of tags.
To support a sustainable development of smart city, smart grid is an indispensable part. Sensor technology in smart gird enables interactive real-time data transmission between cloud and the edge of the network. There are a number of research challenges in the design of smart grids. One of these research challenges is balancing customer privacy and the cloud-based power system's function optimization. Identity-based encryption with equality test (IBEET) scheme has recently been identified as a viable solution, in which customers can delegate a trapdoor to the power system control server and the server then searches on the encrypted data to determine whether two different ciphertexts are encryptions of the same plaintext. Unfortunately, existing schemes are inefficient and the trapdoor could be used to perform equality test on any message; thus, leakage of privacy. In this paper, we propose an efficient IBEET scheme with bilinear pairing, which reduces the need for time-consuming HashToPoint function and each trapdoor could only be used to perform the equality test on a particular keyword. We then prove the security of our scheme for one-way chosen-ciphertext security against a chosen identity (OW-ID-CCA) attack in the random oracle model (ROM). The performance evaluation of our scheme demonstrates that in comparison to the scheme of Ma (2016), our scheme achieves a reduction of 36.7 and 39.24 percent in computation costs during the encryption phase and test phase, respectively.
With the rapid development of service-oriented computing, a large number of software applications have been developed based on the services computing framework. It is well known that software engineering is a knowledge-intensive activity, and thus the effective management of service-related knowledge facilitates service-oriented software development. Although many methodologies have been proposed for service-oriented knowledge management, little attention has been paid to mining knowledge (especially domain-specific functionalities) from service resources. To address this issue, we propose an approach to mine domain knowledge on service goals (i.e., service functionalities) from textual descriptions of services. The approach consists of two components: service goal extraction from textual service descriptions based on linguistic analysis and domain service goal construction that merges semantically similar service goals within a domain. The effectiveness of the proposed approach is validated by a series of experiments conducted on a real-world dataset crawled from the ProgrammableWeb.