IBM Research - India
facilityNew Delhi, India
Research output, citation impact, and the most-cited recent papers from IBM Research - India (India). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from IBM Research - India
The UN COP26 2021 conference on climate change offers the chance for world leaders to take action and make urgent and meaningful commitments to reducing emissions and limit global temperatures to 1.5 °C above pre-industrial levels by 2050. Whilst the political aspects and subsequent ramifications of these fundamental and critical decisions cannot be underestimated, there exists a technical perspective where digital and IS technology has a role to play in the monitoring of potential solutions, but also an integral element of climate change solutions. We explore these aspects in this editorial article, offering a comprehensive opinion based insight to a multitude of diverse viewpoints that look at the many challenges through a technology lens. It is widely recognized that technology in all its forms, is an important and integral element of the solution, but industry and wider society also view technology as being part of the problem. Increasingly, researchers are referencing the importance of responsible digitalization to eliminate the significant levels of e-waste. The reality is that technology is an integral component of the global efforts to get to net zero, however, its adoption requires pragmatic tradeoffs as we transition from current behaviors to a more climate friendly society.
In today's world, online social media plays a vital role during real world events, especially crisis events. There are both positive and negative effects of social media coverage of events, it can be used by authorities for effective disaster management or by malicious entities to spread rumors and fake news. The aim of this paper, is to highlight the role of Twitter, during Hurricane Sandy (2012) to spread fake images about the disaster. We identified 10,350 unique tweets containing fake images that were circulated on Twitter, during Hurricane Sandy. We performed a characterization analysis, to understand the temporal, social reputation and influence patterns for the spread of fake images. Eighty six percent of tweets spreading the fake images were retweets, hence very few were original tweets. Our results showed that top thirty users out of 10,215 users (0.3%) resulted in 90% of the retweets of fake images; also network links such as follower relationships of Twitter, contributed very less (only 11%) to the spread of these fake photos URLs. Next, we used classification models, to distinguish fake images from real images of Hurricane Sandy. Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real. Also, tweet based features were very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor. Our results, showed that, automated techniques can be used in identifying real images from fake images posted on Twitter.
In optimization studies including multi-objective optimization, the main focus is placed on finding the global optimum or global Pareto-optimal solutions, representing the best possible objective values. However, in practice, users may not always be interested in finding the so-called global best solutions, particularly when these solutions are quite sensitive to the variable perturbations which cannot be avoided in practice. In such cases, practitioners are interested in finding the robust solutions which are less sensitive to small perturbations in variables. Although robust optimization is dealt with in detail in single-objective evolutionary optimization studies, in this paper, we present two different robust multi-objective optimization procedures, where the emphasis is to find a robust frontier, instead of the global Pareto-optimal frontier in a problem. The first procedure is a straightforward extension of a technique used for single-objective optimization and the second procedure is a more practical approach enabling a user to set the extent of robustness desired in a problem. To demonstrate the differences between global and robust multi-objective optimization principles and the differences between the two robust optimization procedures suggested here, we develop a number of constrained and unconstrained test problems having two and three objectives and show simulation results using an evolutionary multi-objective optimization (EMO) algorithm. Finally, we also apply both robust optimization methodologies to an engineering design problem.
This paper presents new algorithms-fuzzy c-medoids (FCMdd) and robust fuzzy c-medoids (RFCMdd)-for fuzzy clustering of relational data. The objective functions are based on selecting c representative objects (medoids) from the data set in such a way that the total fuzzy dissimilarity within each cluster is minimized. A comparison of FCMdd with the well-known relational fuzzy c-means algorithm (RFCM) shows that FCMdd is more efficient. We present several applications of these algorithms to Web mining, including Web document clustering, snippet clustering, and Web access log analysis.
In this paper we present the results of a field study of Avaaj Otalo (literally, "voice stoop"), an interactive voice application for small-scale farmers in Gujarat, India. Through usage data and interviews, we describe how 51 farmers used the system over a seven month pilot deployment. The most popular feature of Avaaj Otalo was a forum for asking questions and browsing others' questions and responses on a range of agricultural topics. The forum developed into a lively social space with the emergence of norms, persistent moderation, and a desire for both structured interaction with institutionally sanctioned authorities and open discussion with peers. For all 51 users this was the first experience participating in an online community of any sort. In terms of usability, simple menu-based navigation was readily learned, with users preferring numeric input over speech. We conclude by discussing implications of our findings for designing voice-based social media serving rural communities in India and elsewhere.
Server consolidation has emerged as a promising technique to reduce the energy costs of a data center. In this work, we present the first detailed analysis of an enterprise server workload from the perspective of finding characteristics for consolidation. We observe significant potential for power savings if consolidation is performed using off-peak values for application demand. However, these savings come up with associated risks due to consolidation, particularly when the correlation between applications is not considered. We also investigate the stability in utilization trends for low-risk consolidation. Using the insights from the workload analysis, two new consolidation methods are designed that achieve significant power savings, while containing the performance risk of consolidation. We present an implementation of the methodologies in a consolidation planning tool and provide a comprehensive evaluation study of the proposed methodologies.
Text messaging-based conversational agents (CAs), popularly called chatbots, received significant attention in the last two years. However, chatbots are still in their nascent stage: They have a low penetration rate as 84% of the Internet users have not used a chatbot yet. Hence, understanding the usage patterns of first-time users can potentially inform and guide the design of future chatbots. In this paper, we report the findings of a study with 16 first-time chatbot users interacting with eight chatbots over multiple sessions on the Facebook Messenger platform. Analysis of chat logs and user interviews revealed that users preferred chatbots that provided either a 'human-like' natural language conversation ability, or an engaging experience that exploited the benefits of the familiar turn-based messaging interface. We conclude with implications to evolve the design of chatbots, such as: clarify chatbot capabilities, sustain conversation context, handle dialog failures, and end conversations gracefully.
l1-minimization refers to finding the minimum l1-norm solution to an underdetermined linear system [Formula: see text]. Under certain conditions as described in compressive sensing theory, the minimum l1-norm solution is also the sparsest solution. In this paper, we study the speed and scalability of its algorithms. In particular, we focus on the numerical implementation of a sparsity-based classification framework in robust face recognition, where sparse representation is sought to recover human identities from high-dimensional facial images that may be corrupted by illumination, facial disguise, and pose variation. Although the underlying numerical problem is a linear program, traditional algorithms are known to suffer poor scalability for large-scale applications. We investigate a new solution based on a classical convex optimization framework, known as augmented Lagrangian methods. We conduct extensive experiments to validate and compare its performance against several popular l1-minimization solvers, including interior-point method, Homotopy, FISTA, SESOP-PCD, approximate message passing, and TFOCS. To aid peer evaluation, the code for all the algorithms has been made publicly available.
Most software engineering tasks require developers to understand parts of the source code. When faced with unfamiliar code, developers often rely on (internal or external) documentation to gain an overall understanding of the code and determine whether it is relevant for the current task. Unfortunately, the documentation is often absent or outdated. This paper presents a technique to automatically generate human readable summaries for Java classes, assuming no documentation exists. The summaries allow developers to understand the main goal and structure of the class. The focus of the summaries is on the content and responsibilities of the classes, rather than their relationships with other classes. The summarization tool determines the class and method stereotypes and uses them, in conjunction with heuristics, to select the information to be included in the summaries. Then it generates the summaries using existing lexicalization tools. A group of programmers judged a set of generated summaries for Java classes and determined that they are readable and understandable, they do not include extraneous information, and, in most cases, they are not missing essential information.
This paper describes the design and implementation of a network virtualization substrate ( <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> ) for effective virtualization of wireless resources in cellular networks. Virtualization fosters the realization of several interesting deployment scenarios such as customized virtual networks, virtual services, and wide-area corporate networks, with diverse performance objectives. In virtualizing a base station's uplink and downlink resources into slices, \ssr <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> meets three key requirements-isolation, customization, and efficient resource utilization-using two novel features: 1) <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> introduces a provably optimal slice scheduler that allows existence of slices with bandwidth-based and resource-based reservations simultaneously; and 2) <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> includes a generic framework for efficiently enabling customized flow scheduling within the base station on a per-slice basis. Through a prototype implementation and detailed evaluation on a WiMAX testbed, we demonstrate the efficacy of \ssr <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> . For instance, we show for both downlink and uplink directions that \ssr <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">NVS</i> can run different flow schedulers in different slices, run different slices simultaneously with different types of reservations, and perform slice-specific application optimizations for providing customized services.
Social Network Analysis has emerged as a key paradigm in modern sociology, technology, and information sciences. The paradigm stems from the view that the attributes of an individual in a network are less important than their ties (relationships) with other individuals in the network. Exploring the nature and strength of these ties can help understand the structure and dynamics of social networks and explain real-world phenomena, ranging from organizational efficiency to the spread of information and disease.
Unite neuroscience, supercomputing, and nanotechnology to discover, demonstrate, and deliver the brain's core algorithms.
Current day database applications, with large numbers of users, require fine-grained access control mechanisms, at the level of individual tuples, not just entire relations/views, to control which parts of the data can be accessed by each user. Fine-grained access control is often enforced in the application code, which has numerous drawbacks; these can be avoided by specifying/enforcing access control at the database level. We present a novel fine-grained access control model based on authorization views that allows "authorization-transparent" querying; that is, user queries can be phrased in terms of the database relations, and are valid if they can be answered using only the information contained in these authorization views. We extend earlier work on authorization-transparent querying by introducing a new notion of validity, conditional validity. We give a powerful set of inference rules to check for query validity. We demonstrate the practicality of our techniques by describing how an existing query optimizer can be extended to perform access control checks by incorporating these inference rules.
Detecting and analyzing dense groups or communities from social and information networks has attracted immense attention over the last decade due to its enormous applicability in different domains. Community detection is an ill-defined problem , as the nature of the communities is not known in advance. The problem has turned even more complicated due to the fact that communities emerge in the network in various forms such as disjoint, overlapping, and hierarchical. Various heuristics have been proposed to address these challenges, depending on the application in hand. All these heuristics have been materialized in the form of new metrics , which in most cases are used as optimization functions for detecting the community structure, or provide an indication of the goodness of detected communities during evaluation. Over the last decade, a large number of such metrics have been proposed. Thus, there arises a need for an organized and detailed survey of the metrics proposed for community detection and evaluation. Here, we present a survey of the start-of-the-art metrics used for the detection and the evaluation of community structure. We also conduct experiments on synthetic and real networks to present a comparative analysis of these metrics in measuring the goodness of the underlying community structure.
Understanding the network structure of white matter communication pathways is essential for unraveling the mysteries of the brain's function, organization, and evolution. To this end, we derive a unique network incorporating 410 anatomical tracing studies of the macaque brain from the Collation of Connectivity data on the Macaque brain (CoCoMac) neuroinformatic database. Our network consists of 383 hierarchically organized regions spanning cortex, thalamus, and basal ganglia; models the presence of 6,602 directed long-distance connections; is three times larger than any previously derived brain network; and contains subnetworks corresponding to classic corticocortical, corticosubcortical, and subcortico-subcortical fiber systems. We found that the empirical degree distribution of the network is consistent with the hypothesis of the maximum entropy exponential distribution and discovered two remarkable bridges between the brain's structure and function via network-theoretical analysis. First, prefrontal cortex contains a disproportionate share of topologically central regions. Second, there exists a tightly integrated core circuit, spanning parts of premotor cortex, prefrontal cortex, temporal lobe, parietal lobe, thalamus, basal ganglia, cingulate cortex, insula, and visual cortex, that includes much of the task-positive and task-negative networks and might play a special role in higher cognition and consciousness.
It is well understood from literature that the performance of a machine learning (ML) model is upper bounded by the quality of the data. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality. One of the crucial requirements before consuming datasets for any application is to understand the dataset at hand and failure to do so can result in inaccurate analytics and unreliable decisions. Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. This tutorial surveys all the important data quality related approaches discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems. Finally we will discuss the interesting work IBM Research is doing in this space.
In this paper, we disprove the following conjecture due to Goemans (1997) and Linial (2002): "Every negative type metric embeds into with constant distortion." We show that for every /spl delta/ > 0, and for large enough n, there is an n-point negative type metric which requires distortion at-least (log log n) /sup 1/6-/spl delta// to embed into l/sub 1/. Surprisingly, our construction is inspired by the Unique Games Conjecture (UGC) of Khot (2002), establishing a previously unsuspected connection between PCPs and the theory of metric embeddings. We first prove that the UGC implies super-constant hardness results for (non-uniform) sparsest cut and minimum uncut problems. It is already known that the UGC also implies an optimal hardness result for maximum cut (2004). Though these hardness results depend on the UGC, the integrality gap instances rely "only" on the PCP reductions for the respective problems. Towards this, we first construct an integrality gap instance for a natural SDP relaxation of unique games. Then, we "simulate" the PCP reduction and "translate"the integrality gap instance of unique games to integrality gap instances for the respective cut problems! This enables us to prove a (log log n) /sup 1/6-/spl delta// integrality gap for (nonuniform) sparsest cut and minimum uncut, and an optimal integrality gap for maximum cut. All our SDP solutions satisfy the so-called "triangle inequality" constraints. This also shows, for the first time, that the triangle inequality constraints do not add any power to the Goemans-Williamson's SDP relaxation of maximum cut. The integrality gap for sparsest cut immediately implies a lower bound for embedding negative type metrics into l/sub i/. It also disproves the non-uniform version of Arora, Rao and Vazirani's Conjecture (2004), asserting that the integrality gap of the sparsest cut SDP, with the triangle inequality constraints, is bounded from above by a constant.
A new smart meeting room system called EasyMeeting explores the use of multi-agent systems, Semantic Web ontologies, reasoning, and declarative policies for security and privacy. Building on an earlier pervasive computing system, EasyMeeting provides relevant services and information to meeting participants based on their situational needs. The system also exploits the context-aware support provided by the Context Broker Architecture (Cobra). Cobra's intelligent broker agent maintains a shared context model for all computing entities in the space and enforces user-defined privacy policies.
Power consumption on mobile phones is a painful obstacle towards adoption of continuous sensing driven applications, e.g., continuously inferring individual's locomotive activities (such as 'sit', 'stand' or 'walk') using the embedded accelerometer sensor. To reduce the energy overhead of such continuous activity sensing, we first investigate how the choice of accelerometer sampling frequency & classification features affects, separately for each activity, the "energy overhead" vs. "classification accuracy" tradeoff. We find that such tradeoff is activity specific. Based on this finding, we introduce an activity-sensitive strategy (dubbed "A3R" - Adaptive Accelerometer-based Activity Recognition) for continuous activity recognition, where the choice of both the accelerometer sampling frequency and the classification features are adapted in real-time, as an individual performs daily lifestyle-based activities. We evaluate the performance of A3R using longitudinal, multi-day observations of continuous activity traces. We also implement A3R for the Android platform and carry out evaluation of energy savings. We show that our strategy can achieve an energy savings of 50% under ideal conditions. For users running the A3R application on their Android phones, we achieve an overall energy savings of 20-25%.
We present the first linear time (1 + /spl epsiv/)-approximation algorithm for the k-means problem for fixed k and /spl epsiv/. Our algorithm runs in O(nd) time, which is linear in the size of the input. Another feature of our algorithm is its simplicity - the only technique involved is random sampling.