LinkedIn (United States)
companySunnyvale, California, United States
Research output, citation impact, and the most-cited recent papers from LinkedIn (United States) (United States). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from LinkedIn (United States)
Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state‐of‐the‐art prediction results. Along with the success of deep learning in many application domains, deep learning is also used in sentiment analysis in recent years. This paper gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis. This article is categorized under: Fundamental Concepts of Data and Knowledge > Data Concepts Algorithmic Development > Text Mining
In machine learning often a tradeoff must be made between accuracy and intelligibility. More accurate models such as boosted trees, random forests, and neural nets usually are not intelligible, but more intelligible models such as logistic regression, naive-Bayes, and single decision trees often have significantly worse accuracy. This tradeoff sometimes limits the accuracy of models that can be applied in mission-critical applications such as healthcare where being able to understand, validate, edit, and trust a learned model is important. We present two case studies where high-performance generalized additive models with pairwise interactions (GA2Ms) are applied to real healthcare problems yielding intelligible models with state-of-the-art accuracy. In the pneumonia risk prediction case study, the intelligible model uncovers surprising patterns in the data that previously had prevented complex learned models from being fielded in this domain, but because it is intelligible and modular allows these patterns to be recognized and removed. In the 30-day hospital readmission case study, we show that the same methods scale to large datasets containing hundreds of thousands of patients and thousands of attributes while remaining intelligible and providing accuracy comparable to the best (unintelligible) machine learning methods.
Recommender systems traditionally assume that user profiles and movie attributes are static. Temporal dynamics are purely reactive, that is, they are inferred after they are observed, e.g. after a user's taste has changed or based on hand-engineered temporal bias corrections for movies. We propose Recurrent Recommender Networks (RRN) that are able to predict future behavioral trajectories. This is achieved by endowing both users and movies with a Long Short-Term Memory (LSTM) autoregressive model that captures dynamics, in addition to a more traditional low-rank factorization. On multiple real-world datasets, our model offers excellent prediction accuracy and it is very compact, since we need not learn latent state but rather just the state transition function.
Previous work analyzing social networks has mainly focused on binary friendship relations. However, in online social networks the low cost of link formation can lead to networks with heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together). In this case, the binary friendship indicator provides only a coarse representation of relationship information. In this work, we develop an unsupervised model to estimate relationship strength from interaction activity (e.g., communication, tagging) and user similarity. More specifically, we formulate a link-based latent variable model, along with a coordinate ascent optimization procedure for the inference. We evaluate our approach on real-world data from Facebook, showing that the estimated link weights result in higher autocorrelation and lead to improved classification accuracy. 1
Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.
Groups of firms often achieve a competitive advantage through the formation of geo-industrial clusters. Although many exemplary clusters are the subjects of case studies, systematic approaches to identify and analyze the hierarchical structure of geo-industrial clusters at the global scale are scarce. In this work, we use LinkedIn’s employment history data from more than 500 million users over 25 years to construct a labor flow network of over 4 million firms across the world, from which we reveal hierarchical structure by applying network community detection. We show that the resulting geo-industrial clusters exhibit a stronger association between the influx of educated workers and financial performance, compared to traditional aggregation units. Furthermore, our analysis of the skills of educated workers reveals richer insights into the relationship between the labor flow of educated workers and productivity growth. We argue that geo-industrial clusters defined by labor flow provide useful insights into the growth of the economy.
Abstract Traffic forecasting is a challenging problem due to the complexity of jointly modeling spatio‐temporal dependencies at different scales. Recently, several hybrid deep learning models have been developed to capture such dependencies. These approaches typically utilize convolutional neural networks or graph neural networks (GNNs) to model spatial dependency and leverage recurrent neural networks (RNNs) to learn temporal dependency. However, RNNs are only able to capture sequential information in the time series, while being incapable of modeling their periodicity (e.g., weekly patterns). Moreover, RNNs are difficult to parallelize, making training and prediction less efficient. In this work we propose a novel deep learning architecture called Traffic Transformer to capture the continuity and periodicity of time series and to model spatial dependency. Our work takes inspiration from Google’s Transformer framework for machine translation. We conduct extensive experiments on two real‐world traffic data sets, and the results demonstrate that our model outperforms baseline models by a substantial margin.
The contiguous United States (CONUS), especially the West, faces challenges of increasing water stress and uncertain impacts of climate change. The historical information of surface water body distribution, variation, and multidecadal trends documented in remote-sensing images can aid in water-resource planning and management, yet is not well explored. Here, we detected open-surface water bodies in all Landsat 5, 7, and 8 images (∼370,000 images, >200 TB) of the CONUS and generated 30-meter annual water body frequency maps for 1984-2016. We analyzed the interannual variations and trends of year-long water body area, examined the impacts of climatic and anthropogenic drivers on water body area dynamics, and explored the relationships between water body area and land water storage (LWS). Generally, the western half of the United States is prone to water stress, with small water body area and large interannual variability. During 1984-2016, water-poor regions of the Southwest and Northwest had decreasing trends in water body area, while water-rich regions of the Southeast and far north Great Plains had increasing trends. These divergent trends, mainly driven by climate, enlarged water-resource gaps and are likely to continue according to climate projections. Water body area change is a good indicator of LWS dynamics in 58% of the CONUS. Following the 2012 prolonged drought, LWS in California and the southern Great Plains had a larger decrease than surface water body area, likely caused by massive groundwater withdrawals. Our findings provide valuable information for surface water-resource planning and management across the CONUS.
We present a framework for quantifying and mitigating algorithmic bias in mechanisms designed for ranking individuals, typically used as part of web-scale search and recommendation systems. We first propose complementary measures to quantify bias with respect to protected attributes such as gender and age. We then present algorithms for computing fairness-aware re-ranking of results. For a given search or recommendation task, our algorithms seek to achieve a desired distribution of top ranked results with respect to one or more protected attributes. We show that such a framework can be tailored to achieve fairness criteria such as equality of opportunity and demographic parity depending on the choice of the desired distribution. We evaluate the proposed algorithms via extensive simulations over different parameter choices, and study the effect of fairness-aware ranking on both bias and utility measures. We finally present the online A/B testing results from applying our framework towards representative ranking in LinkedIn Talent Search, and discuss the lessons learned in practice. Our approach resulted in tremendous improvement in the fairness metrics (nearly three fold increase in the number of search queries with representative results) without affecting the business metrics, which paved the way for deployment to 100% of LinkedIn Recruiter users worldwide. Ours is the first large-scale deployed framework for ensuring fairness in the hiring domain, with the potential positive impact for more than 630M LinkedIn members.
Blockchain is a shared distributed digital ledger technology that can better facilitate data management, provenance and security, and has the potential to transform healthcare. Importantly, blockchain represents a data architecture, whose application goes far beyond Bitcoin - the cryptocurrency that relies on blockchain and has popularized the technology. In the health sector, blockchain is being aggressively explored by various stakeholders to optimize business processes, lower costs, improve patient outcomes, enhance compliance, and enable better use of healthcare-related data. However, critical in assessing whether blockchain can fulfill the hype of a technology characterized as 'revolutionary' and 'disruptive', is the need to ensure that blockchain design elements consider actual healthcare needs from the diverse perspectives of consumers, patients, providers, and regulators. In addition, answering the real needs of healthcare stakeholders, blockchain approaches must also be responsive to the unique challenges faced in healthcare compared to other sectors of the economy. In this sense, ensuring that a health blockchain is 'fit-for-purpose' is pivotal. This concept forms the basis for this article, where we share views from a multidisciplinary group of practitioners at the forefront of blockchain conceptualization, development, and deployment.
Clickthrough and conversation rates estimation are two core predictions tasks in display advertising. We present in this article a machine learning framework based on logistic regression that is specifically designed to tackle the specifics of display advertising. The resulting system has the following characteristics: It is easy to implement and deploy, it is highly scalable (we have trained it on terabytes of data), and it provides models with state-of-the-art accuracy.
This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms. Demand for scaling up machine learning is task-specific: for some tasks it is driven by the enormous dataset sizes, for others by model complexity or by the requirement for real-time prediction. Selecting a task-appropriate parallelization platform and algorithm requires understanding their benefits, trade-offs and constraints. This tutorial focuses on providing an integrated overview of state-of-the-art platforms and algorithm choices. These span a range of hardware options (from FPGAs and GPUs to multi-core systems and commodity clusters), programming frameworks (including CUDA, MPI, MapReduce, and DryadLINQ), and learning settings (e.g., semi-supervised and online learning). The tutorial is example-driven, covering a number of popular algorithms (e.g., boosted trees, spectral clustering, belief propagation) and diverse applications (e.g., recommender systems and object recognition in vision).
On behalf of the organizing committee, it is our great pleasure to welcome you to the 22nd ACM International Conference on Information and Knowledge Management (CIKM 2013) in San Francisco! CIKM is a premier ACM conference in the areas of information retrieval, knowledge management and databases. Since 1992, it has successfully brought together leading researchers and developers from the three communities. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems, and to shape future research directions through the publication of high quality applied and theoretical research findings. In CIKM 2013, we continue the tradition of promoting collaboration among multiple areas and providing a leading forum in which experts from academia, industry, and government gather to exchange ideas, research results, and technical developments in multidisciplinary research areas. As one of the world's most recognized conferences in the field, this year CIKM received 848 valid full paper submissions, 233 poster submissions, and 57 demonstration submissions. Among them, we accepted 143 full papers (16.86% acceptance rate), 107 short papers, 81 posters and 21 demos. In addition to regular research tracks, CIKM 2013 features 4 keynote speakers, a panel on Big Data, dedicated Industry events featuring 10 leading industrial practitioners, 10 tutorials from nprestigious researchers and 14 workshops on cutting-edge areas of research. This is a great demonstration of the lively research areas that contribute to the CIKM area. We are proud of our final program and gratefully thank all authors, invited speakers and organizers who chose to contribute their time and research to CIKM 2013. We are honored to present four distinguished keynote speakers to attendees: Ronald Fagin, Lee Giles, Carlos Guestrin, and Alon Halevy. Their valuable, insightful and interdisciplinary talks will guide us to a better understanding of the field.
There is an increasing attention on next-item recommendation systems to infer the dynamic user preferences with sequential user interactions. While the semantics of an item can change over time and across users, the item correlations defined by user interactions in the short term can be distilled to capture such change, and help in uncovering the dynamic user preferences. Thus, we are motivated to develop a novel next-item recommendation framework empowered by sequential hypergraphs. Specifically, the framework: (i) adopts hypergraph to represent the short-term item correlations and applies multiple convolutional layers to capture multi-order connections in the hypergraph; (ii) models the connections between different time periods with a residual gating layer; and (iii) is equipped with a fusion layer to incorporate both the dynamic item embedding and short-term user intent to the representation of each interaction before feeding it into the self-attention layer for dynamic user modeling. Through experiments on datasets from the ecommerce sites Amazon and Etsy and the information sharing platform Goodreads, the proposed model can significantly outperform the state-of-the-art in predicting the next interesting item for each user.
We present a large-scale study of gender bias in occupation classification, a task where the use of machine learning may lead to negative outcomes on peoples' lives. We analyze the potential allocation harms that can result from semantic representation bias. To do so, we study the impact on occupation classification of including explicit gender indicators---such as first names and pronouns---in different semantic representations of online biographies. Additionally, we quantify the bias that remains when these indicators are "scrubbed," and describe proxy behavior that occurs in the absence of explicit gender indicators. As we demonstrate, differences in true positive rates between genders are correlated with existing gender imbalances in occupations, which may compound these imbalances.
Distributed stream processing systems need to support stateful processing, recover quickly from failures to resume such processing, and reprocess an entire data stream quickly. We present Apache Samza, a distributed system for stateful and fault-tolerant stream processing. Samza utilizes a partitioned local state along with a low-overhead background changelog mechanism, allowing it to scale to massive state sizes (hundreds of TB) per application. Recovery from failures is sped up by re-scheduling based on Host Affinity. In addition to processing infinite streams of events, Samza supports processing a finite dataset as a stream, from either a streaming source (e.g., Kafka), a database snapshot (e.g., Databus), or a file system (e.g. HDFS), without having to change the application code (unlike the popular Lambda-based architectures which necessitate maintenance of separate code bases for batch and stream path processing). Samza is currently in use at LinkedIn by hundreds of production applications with more than 10, 000 containers. Samza is an open-source Apache project adopted by many top-tier companies (e.g., LinkedIn, Uber, Netflix, TripAdvisor, etc.). Our experiments show that Samza: a) handles state efficiently, improving latency and throughput by more than 100X compared to using a remote storage; b) provides recovery time independent of state size; c) scales performance linearly with number of containers; and d) supports reprocessing of the data stream quickly and with minimal interference on real-time traffic.
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests. Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions. Learn how to • Use the scientific method to evaluate hypotheses using controlled experiments • Define key metrics and ideally an Overall Evaluation Criterion • Test for trustworthiness of the results and alert experimenters to violated assumptions • Build a scalable platform that lowers the marginal cost of experiments close to zero • Avoid pitfalls like carryover effects and Twyman's law • Understand how statistical issues play out in practice.
Data and knowledge of the spatial-temporal dynamics of surface water area (SWA) and terrestrial water storage (TWS) in China are critical for sustainable management of water resources but remain very limited. Here we report annual maps of surface water bodies in China during 1989-2016 at 30m spatial resolution. We find that SWA decreases in water-poor northern China but increases in water-rich southern China during 1989-2016. Our results also reveal the spatial-temporal divergence and consistency between TWS and SWA during 2002-2016. In North China, extensive and continued losses of TWS, together with small to moderate changes of SWA, indicate long-term water stress in the region. Approximately 569 million people live in those areas with deceasing SWA or TWS trends in 2015. Our data set and the findings from this study could be used to support the government and the public to address increasing challenges of water resources and security in China.
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn, and multiple Microsoft properties, we share seven rules of thumb for experimenters, which we have generalized from these experiments and their results. These are principles that we believe have broad applicability in web optimization and analytics outside of controlled experiments, yet they are not provably correct, and in some cases exceptions are known.
A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used among online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions. At LinkedIn, we have seen tremendous growth of controlled experiments over time, with now over 400 concurrent experiments running per day. General A/B testing frameworks and methodologies, including challenges and pitfalls, have been discussed extensively in several previous KDD work [7, 8, 9, 10]. In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting. We start with an introduction of the experimentation platform and how it is built to handle each step of the A/B testing process at LinkedIn, from designing and deploying experiments to analyzing them. It is then followed by discussions on several more sophisticated A/B testing scenarios, such as running offline experiments and addressing the network effect, where one user's action can influence that of another. Lastly, we talk about features and processes that are crucial for building a strong experimentation culture.