NobleBlocks
Statistical and Applied Mathematical Sciences Institute logo

Statistical and Applied Mathematical Sciences Institute

facilityDurham, United States

Research output, citation impact, and the most-cited recent papers from Statistical and Applied Mathematical Sciences Institute (United States). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
790
Citations
36.7K
h-index
99
i10-index
512
Also known as
Statistical and Applied Mathematical Sciences Institute

Top-cited papers from Statistical and Applied Mathematical Sciences Institute

Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables
Nicholas G. Polson, James G. Scott, Jesse Windle
2013· Journal of the American Statistical Association951doi:10.1080/01621459.2013.829001

We propose a new data-augmentation strategy for fully Bayesian inference in models with binomial likelihoods. The approach appeals to a new class of Pólya--Gamma distributions, which are constructed in detail. A variety of examples are presented to show the versatility of the method, including logistic regression, negative binomial regression, nonlinear mixed-effect models, and spatial models for count data. In each case, our data-augmentation strategy leads to simple, effective methods for posterior inference that (1) circumvent the need for analytic approximations, numerical integration, or Metropolis--Hastings; and (2) outperform other known data-augmentation strategies, both in ease of use and in computational efficiency. All methods, including an efficient sampler for the Pólya--Gamma distribution, are implemented in the R package BayesLogit . Supplementary materials for this article are available online.

The case for objective Bayesian analysis
James O. Berger
2006· Bayesian Analysis741doi:10.1214/06-ba115

Bayesian statistical practice makes extensive use of versions of objective Bayesian analysis. We discuss why this is so, and address some of the criticisms that have been raised concerning objective Bayesian analysis. The dangers of treating the issue too casually are also considered. In particular, we suggest that the statistical community should accept formal objective Bayesian techniques with confidence, but should be more cautious about casual objective Bayesian techniques.

Failure to migrate: lack of tree range expansion in response to climate change
Kai Zhu, Christopher W. Woodall, James S. Clark
2011· Global Change Biology721doi:10.1111/j.1365-2486.2011.02571.x

Abstract Tree species are expected to track warming climate by shifting their ranges to higher latitudes or elevations, but current evidence of latitudinal range shifts for suites of species is largely indirect. In response to global warming, offspring of trees are predicted to have ranges extend beyond adults at leading edges and the opposite relationship at trailing edges. Large‐scale forest inventory data provide an opportunity to compare present latitudes of seedlings and adult trees at their range limits. Using the USDA Forest Service's Forest Inventory and Analysis data, we directly compared seedling and tree 5th and 95th percentile latitudes for 92 species in 30 longitudinal bands for 43 334 plots across the eastern United States. We further compared these latitudes with 20th century temperature and precipitation change and functional traits, including seed size and seed spread rate. Results suggest that 58.7% of the tree species examined show the pattern expected for a population undergoing range contraction, rather than expansion, at both northern and southern boundaries. Fewer species show a pattern consistent with a northward shift (20.7%) and fewer still with a southward shift (16.3%). Only 4.3% are consistent with expansion at both range limits. When compared with the 20th century climate changes that have occurred at the range boundaries themselves, there is no consistent evidence that population spread is greatest in areas where climate has changed most; nor are patterns related to seed size or dispersal characteristics. The fact that the majority of seedling extreme latitudes are less than those for adult trees may emphasize the lack of evidence for climate‐mediated migration, and should increase concerns for the risks posed by climate change.

A comprehensive evaluation of predictive performance of 33 species distribution models at species and community levels
Anna Norberg, Nerea Abrego, F. Guillaume Blanchet, Frederick R. Adler +4 more
2019· Ecological Monographs551doi:10.1002/ecm.1370

Abstract A large array of species distribution model ( SDM ) approaches has been developed for explaining and predicting the occurrences of individual species or species assemblages. Given the wealth of existing models, it is unclear which models perform best for interpolation or extrapolation of existing data sets, particularly when one is concerned with species assemblages. We compared the predictive performance of 33 variants of 15 widely applied and recently emerged SDM s in the context of multispecies data, including both joint SDM s that model multiple species together, and stacked SDM s that model each species individually combining the predictions afterward. We offer a comprehensive evaluation of these SDM approaches by examining their performance in predicting withheld empirical validation data of different sizes representing five different taxonomic groups, and for prediction tasks related to both interpolation and extrapolation. We measure predictive performance by 12 measures of accuracy, discrimination power, calibration, and precision of predictions, for the biological levels of species occurrence, species richness, and community composition. Our results show large variation among the models in their predictive performance, especially for communities comprising many species that are rare. The results do not reveal any major trade‐offs among measures of model performance; the same models performed generally well in terms of accuracy, discrimination, and calibration, and for the biological levels of individual species, species richness, and community composition. In contrast, the models that gave the most precise predictions were not well calibrated, suggesting that poorly performing models can make overconfident predictions. However, none of the models performed well for all prediction tasks. As a general strategy, we therefore propose that researchers fit a small set of models showing complementary performance, and then apply a cross‐validation procedure involving separate data to establish which of these models performs best for the goal of the study.

Genetic heterogeneity of diffuse large B-cell lymphoma
Jenny Zhang, Vladimir Grubor, Cassandra Love, Anjishnu Banerjee +4 more
2013· Proceedings of the National Academy of Sciences534doi:10.1073/pnas.1205299110

Diffuse large B-cell lymphoma (DLBCL) is the most common form of lymphoma in adults. The disease exhibits a striking heterogeneity in gene expression profiles and clinical outcomes, but its genetic causes remain to be fully defined. Through whole genome and exome sequencing, we characterized the genetic diversity of DLBCL. In all, we sequenced 73 DLBCL primary tumors (34 with matched normal DNA). Separately, we sequenced the exomes of 21 DLBCL cell lines. We identified 322 DLBCL cancer genes that were recurrently mutated in primary DLBCLs. We identified recurrent mutations implicating a number of known and not previously identified genes and pathways in DLBCL including those related to chromatin modification (ARID1A and MEF2B), NF-κB (CARD11 and TNFAIP3), PI3 kinase (PIK3CD, PIK3R1, and MTOR), B-cell lineage (IRF8, POU2F2, and GNA13), and WNT signaling (WIF1). We also experimentally validated a mutation in PIK3CD, a gene not previously implicated in lymphomas. The patterns of mutation demonstrated a classic long tail distribution with substantial variation of mutated genes from patient to patient and also between published studies. Thus, our study reveals the tremendous genetic heterogeneity that underlies lymphomas and highlights the need for personalized medicine approaches to treating these patients.

The foundations of factor analysis
David J. Bartholomew
1984· Biometrika419doi:10.1093/biomet/71.2.221

A new approach to factor analysis and related latent variable methods is proposed which is based on data reduction using the idea of Bayesian sufficiency. Considerations of symmetry, invariance and independence are used to determine an appropriate family of models. The results are expressed in terms of linear functions of the manifest variables after the manner of principal components analysis. The approach justifies some of the practices based on the normal theory factor model and lays a foundation for the treatment of nonnormal, including categorical, variables.

Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel
Jie Huang, Bryan Howie, Shane McCarthy, Yasin Memari +4 more
2015· Nature Communications384doi:10.1038/ncomms9111

Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low depth (average 7x), aiming to exhaustively characterize genetic variation down to 0.1% minor allele frequency in the British population. Here we demonstrate the value of this resource for improving imputation accuracy at rare and low-frequency variants in both a UK and an Italian population. We show that large increases in imputation accuracy can be achieved by re-phasing WGS reference panels after initial genotype calling. We also present a method for combining WGS panels to improve variant coverage and downstream imputation accuracy, which we illustrate by integrating 7,562 WGS haplotypes from the UK10K project with 2,184 haplotypes from the 1000 Genomes Project. Finally, we introduce a novel approximation that maintains speed without sacrificing imputation accuracy for rare variants.

Innovative diagnostic tools for early detection of Alzheimer's disease
Christoph Laske, Hamid R. Sohrabi, Shaun Frost, Karmele López de Ipiña +4 more
2014· Alzheimer s & Dementia315doi:10.1016/j.jalz.2014.06.004

Current state-of-the-art diagnostic measures of Alzheimer's disease (AD) are invasive (cerebrospinal fluid analysis), expensive (neuroimaging) and time-consuming (neuropsychological assessment) and thus have limited accessibility as frontline screening and diagnostic tools for AD. Thus, there is an increasing need for additional noninvasive and/or cost-effective tools, allowing identification of subjects in the preclinical or early clinical stages of AD who could be suitable for further cognitive evaluation and dementia diagnostics. Implementation of such tests may facilitate early and potentially more effective therapeutic and preventative strategies for AD. Before applying them in clinical practice, these tools should be examined in ongoing large clinical trials. This review will summarize and highlight the most promising screening tools including neuropsychometric, clinical, blood, and neurophysiological tests.

Efficacy of Endoscopic Ultrasound-guided Celiac Plexus Block and Celiac Plexus Neurolysis for Managing Abdominal Pain Associated With Chronic Pancreatitis and Pancreatic Cancer
Marina Kaufman, Gurpreet Singh, Sourish Das, Ronald Concha-Parra +3 more
2010· Journal of Clinical Gastroenterology299doi:10.1097/mcg.0b013e3181bb854d

BACKGROUND/GOALS: Endoscopic ultrasound (EUS)-guided celiac plexus block (CPB) and celiac plexus neurolysis (CPN) have become important interventions in the management of pain due to chronic pancreatitis and pancreatic cancer. However, only a few well-structured studies have been performed to evaluate their efficacy. Given limited data, their use remains controversial. Herein, we evaluate the efficacy of EUS-guided CPB and CPN in alleviating chronic abdominal pain due to chronic pancreatitis and pancreatic cancer respectively. STUDY METHODS: Using Medline, Pubmed, and Embase databases from January 1966 through December 2007, a thorough search of the English literature for studies evaluating the efficacy of EUS-guided CPB and CPN for the management of chronic abdominal pain due to chronic pancreatitis and pancreatic cancer was conducted, along with a hand search of reference lists. Studies that involved less than 10 patients were excluded. Data on pain relief was extracted, pooled, and analyzed. RESULTS: A total of 9 studies were included in the final analysis. For chronic pancreatitis, 6 relevant studies were identified, comprising a total of 221 patients. EUS-guided CPB was effective in alleviating abdominal pain in 51.46% of patients. For pancreatic cancer, 5 relevant studies were identified with a total of 119 patients. EUS-guided CPN was effective in alleviating abdominal pain in 72.54% of patients. CONCLUSIONS: EUS-guided CPB was 51.46% effective in managing chronic abdominal pain in patients with chronic pancreatitis, but warrants improvement in patient selection and refinement of technique, whereas EUS-guided CPN was 72.54% effective in managing pain due to pancreatic cancer and is a reasonable option for patients with tolerance to narcotic analgesics.

Generalized joint attribute modeling for biodiversity analysis: median‐zero, multivariate, multifarious data
James S. Clark, Diana R. Nemergut, Bijan Seyednasrollah, Phillip J. Turner +1 more
2016· Ecological Monographs286doi:10.1002/ecm.1241

Abstract Probabilistic forecasts of species distribution and abundance require models that accommodate the range of ecological data, including a joint distribution of multiple species based on combinations of continuous and discrete observations, mostly zeros. We develop a generalized joint attribute model ( GJAM ), a probabilistic framework that readily applies to data that are combinations of presence‐absence, ordinal, continuous, discrete, composition, zero‐inflated, and censored. It does so as a joint distribution over all species providing inference on sensitivity to input variables, correlations between species on the data scale, prediction, sensitivity analysis, definition of community structure, and missing data imputation. GJAM applications illustrate flexibility to the range of species‐abundance data. Applications to forest inventories demonstrate species relationships responding as a community to environmental variables. It shows that the environment can be inverse predicted from the joint distribution of species. Application to microbiome data demonstrates how inverse prediction in the GJAM framework accelerates variable selection, by isolating effects of each input variable's influence across all species.

Propensity score weighting with multilevel data
Fan Li, Alan M. Zaslavsky, Mary Beth Landrum
2013· Statistics in Medicine264doi:10.1002/sim.5786

Propensity score methods are being increasingly used as a less parametric alternative to traditional regression to balance observed differences across groups in both descriptive and causal comparisons. Data collected in many disciplines often have analytically relevant multilevel or clustered structure. The propensity score, however, was developed and has been used primarily with unstructured data. We present and compare several propensity-score-weighted estimators for clustered data, including marginal, cluster-weighted, and doubly robust estimators. Using both analytical derivations and Monte Carlo simulations, we illustrate bias arising when the usual assumptions of propensity score analysis do not hold for multilevel data. We show that exploiting the multilevel structure, either parametrically or nonparametrically, in at least one stage of the propensity score analysis can greatly reduce these biases. We applied these methods to a study of racial disparities in breast cancer screening among beneficiaries of Medicare health plans.

Probability measures on the space of persistence diagrams
Yuriy Mileyko, Sayan Mukherjee, John Harer
2011· Inverse Problems253doi:10.1088/0266-5611/27/12/124007

This paper shows that the space of persistence diagrams has properties that allow for the definition of probability measures which support expectations, variances, percentiles and conditional probabilities. This provides a theoretical basis for a statistical treatment of persistence diagrams, for example computing sample averages and sample variances of persistence diagrams. We first prove that the space of persistence diagrams with the Wasserstein metric is complete and separable. We then prove a simple criterion for compactness in this space. These facts allow us to show the existence of the standard statistical objects needed to extend the theory of topological persistence to a much larger set of applications.

GENERALIZED DOUBLE PARETO SHRINKAGE.
Artin Armagan, David B. Dunson, Jaeyong Lee
2013· PubMed246

-like tail behavior. Bayesian computation is straightforward via a simple Gibbs sampling algorithm. We investigate the properties of the maximum a posteriori estimator, as sparse estimation plays an important role in many problems, reveal connections with some well-established regularization procedures, and show some asymptotic results. The performance of the prior is tested through simulations and an application.

Similarity Coefficients for Binary Chemoinformatics Data: Overview and Extended Comparison Using Simulated and Real Data Sets
Roberto Todeschini, Viviana Consonni, Hua Xiang, John D. Holliday +2 more
2012· Journal of Chemical Information and Modeling234doi:10.1021/ci300261r

This paper reports an analysis and comparison of the use of 51 different similarity coefficients for computing the similarities between binary fingerprints for both simulated and real chemical data sets. Five pairs and a triplet of coefficients were found to yield identical similarity values, leading to the elimination of seven of the coefficients. The remaining 44 coefficients were then compared in two ways: by their theoretical characteristics using simple descriptive statistics, correlation analysis, multidimensional scaling, Hasse diagrams, and the recently described atemporal target diffusion model; and by their effectiveness for similarity-based virtual screening using MDDR, WOMBAT, and MUV data. The comparisons demonstrate the general utility of the well-known Tanimoto method but also suggest other coefficients that may be worthy of further attention.

Using joint species distribution models for evaluating how species‐to‐species associations depend on the environmental context
Gleb Tikhonov, Nerea Abrego, David B. Dunson, Otso Ovaskainen
2017· Methods in Ecology and Evolution225doi:10.1111/2041-210x.12723

Summary Joint species distribution models ( JSDM ) are increasingly used to analyse community ecology data. Recent progress with JSDM s has provided ecologists with new tools for estimating species associations (residual co‐occurrence patterns after accounting for environmental niches) from large data sets, as well as for increasing the predictive power of species distribution models ( SDM s) by accounting for such associations. Yet, one critical limitation of JSDM s developed thus far is that they assume constant species associations. However, in real ecological communities, the direction and strength of interspecific interactions are likely to be different under different environmental conditions. In this paper, we overcome the shortcoming of present JSDM s by allowing species associations covary with measured environmental covariates. To estimate environmental‐dependent species associations, we utilize a latent variable structure, where the factor loadings are modelled as a linear regression to environmental covariates. We illustrate the performance of the statistical framework with both simulated and real data. Our results show that JSDM s perform substantially better in inferring environmental‐dependent species associations than single SDM s, especially with sparse data. Furthermore, JSDM s consistently overperform SDM s in terms of predictive power for generating predictions that account for environment‐dependent biotic associations. We implemented the statistical framework as a MATLAB package, which includes tools both for model parameterization as well as for post‐processing of results, particularly for addressing whether and how species associations depend on the environmental conditions. Our statistical framework provides a new tool for ecologists who wish to investigate from non‐manipulative observational community data the dependency of interspecific interactions on environmental context. Our method can be applied to answer the fundamental questions in community ecology about how species’ interactions shift in changing environmental conditions, as well as to predict future changes of species’ interactions in response to global change.

Nonparametric Bayes Modeling of Multivariate Categorical Data
David B. Dunson, Chuanhua Xing
2009· Journal of the American Statistical Association218doi:10.1198/jasa.2009.tm08439

Modeling of multivariate unordered categorical (nominal) data is a challenging problem, particularly in high dimensions and cases in which one wishes to avoid strong assumptions about the dependence structure. Commonly used approaches rely on the incorporation of latent Gaussian random variables or parametric latent class models. The goal of this article is to develop a nonparametric Bayes approach, which defines a prior with full support on the space of distributions for multiple unordered categorical variables. This support condition ensures that we are not restricting the dependence structure a priori. We show this can be accomplished through a Dirichlet process mixture of product multinomial distributions, which is also a convenient form for posterior computation. Methods for nonparametric testing of violations of independence are proposed, and the methods are applied to model positional dependence within transcription factor binding motifs.

Causal Network Inference by Optimal Causation Entropy
Jie Sun, Dane Taylor, Erik M. Bollt
2015· SIAM Journal on Applied Dynamical Systems215doi:10.1137/140956166

The broad abundance of time series data, which is in sharp contrast to limited knowledge of the underlying network dynamic processes that produce such observations, calls for a rigorous and efficient method of causal network inference. Here we develop mathematical theory of causation entropy, an information-theoretic statistic designed for model-free causality inference. For stationary Markov processes, we prove that for a given node in the network, its causal parents form the minimal set of nodes that maximizes causation entropy, a result we refer to as the optimal causation entropy principle. Furthermore, this principle guides us in developing computational and data efficient algorithms for causal network inference based on a two-step discovery and removal algorithm for time series data for a network-coupled dynamical system. Validation in terms of analytical and numerical results for Gaussian processes on large random networks highlights that inference by our algorithm outperforms previous leading methods, including conditional Granger causality and transfer entropy. Interestingly, our numerical results suggest that the number of samples required for accurate inference depends strongly on network characteristics such as the density of links and information diffusion rate and not necessarily on the number of nodes.

The Genetic Basis of Hepatosplenic T-cell Lymphoma
Matthew McKinney, Andrea B. Moffitt, Philippe Gaulard, Marion Travert +4 more
2017· Cancer Discovery202doi:10.1158/2159-8290.cd-16-0330

Abstract Hepatosplenic T-cell lymphoma (HSTL) is a rare and lethal lymphoma; the genetic drivers of this disease are unknown. Through whole-exome sequencing of 68 HSTLs, we define recurrently mutated driver genes and copy-number alterations in the disease. Chromatin-modifying genes, including SETD2, INO80, and ARID1B, were commonly mutated in HSTL, affecting 62% of cases. HSTLs manifest frequent mutations in STAT5B (31%), STAT3 (9%), and PIK3CD (9%), for which there currently exist potential targeted therapies. In addition, we noted less frequent events in EZH2, KRAS, and TP53. SETD2 was the most frequently silenced gene in HSTL. We experimentally demonstrated that SETD2 acts as a tumor suppressor gene. In addition, we found that mutations in STAT5B and PIK3CD activate critical signaling pathways important to cell survival in HSTL. Our work thus defines the genetic landscape of HSTL and implicates gene mutations linked to HSTL pathogenesis and potential treatment targets. Significance: We report the first systematic application of whole-exome sequencing to define the genetic basis of HSTL, a rare but lethal disease. Our work defines SETD2 as a tumor suppressor gene in HSTL and implicates genes including INO80 and PIK3CD in the disease. Cancer Discov; 7(4); 369–79. ©2017 AACR. See related commentary by Yoshida and Weinstock, p. 352. This article is highlighted in the In This Issue feature, p. 339

Major causes of death in preterm infants in selected hospitals in Ethiopia (SIP): a prospective, cross-sectional, observational study
Lulu Muhe, Elizabeth M. McClure, Assaye K. Nigussie, Amha Mekasha +4 more
2019· The Lancet Global Health199doi:10.1016/s2214-109x(19)30220-7

BACKGROUND: Neonatal deaths now account for 47% of all deaths in children younger than 5 years globally. More than a third of newborn deaths are due to preterm birth complications, which is the leading cause of death. Understanding the causes and factors contributing to neonatal deaths is needed to identify interventions that will reduce mortality. We aimed to establish the major causes of preterm mortality in preterm infants in the first 28 days of life in Ethiopia. METHODS: We did a prospective, cross-sectional, observational study in five hospitals in Ethiopia. Study participants were preterm infants born in the study hospitals at younger than 37 gestational weeks. Infants whose gestational age could not be reliably estimated and those born as a result of induced abortion were excluded from the study. Data were collected on maternal and obstetric history, clinical maternal and neonatal conditions, and laboratory investigations. For neonates who died of those enrolled, consent was requested from parents for post-mortem examinations (both complete diagnostic autopsy and minimally invasive tissue sampling). An independent panel of experts established the primary and contributory causes of preterm mortality with available data. FINDINGS: Between July 1, 2016, to May 31, 2018, 4919 preterm infants were enrolled in the study and 3852 were admitted to neonatal intensive care units. By 28 days of post-natal age, 1109 (29%) of those admitted to the neonatal intensive care unit died. Complete diagnostic autopsy was done in 441 (40%) and minimally invasive tissue sampling in 126 (11%) of the neonatal intensive care unit deaths. The main primary causes of death in the 1109 infants were established as respiratory distress syndrome (502 [45%]); sepsis, pneumonia and meningitis (combined as neonatal infections; 331 [30%]), and asphyxia (151 [14%]). Hypothermia was the most common contributory cause of preterm mortality (770 [69%]). The highest mortality occurred in infants younger than 28 weeks of gestation (89 [86%] of 104), followed by infants aged 28-31 weeks (512 [54%] of 952), 32-34 weeks (349 [18%] of 1975), and 35-36 weeks (159 [8%] of 1888). INTERPRETATION: Three conditions accounted for 89% of all deaths among preterm infants in Ethiopia. Scale-up interventions are needed to prevent or treat these conditions. Further research is required to develop effective and affordable interventions to prevent and treat the major causes of preterm death. FUNDING: Bill & Melinda Gates Foundation.

Space‐Time Data fusion Under Error in Computer Model Output: An Application to Modeling Air Quality
Veronica J. Berrocal, Alan E. Gelfand, David M. Holland
2011· Biometrics194doi:10.1111/j.1541-0420.2011.01725.x

We provide methods that can be used to obtain more accurate environmental exposure assessment. In particular, we propose two modeling approaches to combine monitoring data at point level with numerical model output at grid cell level, yielding improved prediction of ambient exposure at point level. Extending our earlier downscaler model (Berrocal, V. J., Gelfand, A. E., and Holland, D. M. (2010b). A spatio-temporal downscaler for outputs from numerical models. Journal of Agricultural, Biological and Environmental Statistics 15, 176-197), these new models are intended to address two potential concerns with the model output. One recognizes that there may be useful information in the outputs for grid cells that are neighbors of the one in which the location lies. The second acknowledges potential spatial misalignment between a station and its putatively associated grid cell. The first model is a Gaussian Markov random field smoothed downscaler that relates monitoring station data and computer model output via the introduction of a latent Gaussian Markov random field linked to both sources of data. The second model is a smoothed downscaler with spatially varying random weights defined through a latent Gaussian process and an exponential kernel function, that yields, at each site, a new variable on which the monitoring station data is regressed with a spatial linear model. We applied both methods to daily ozone concentration data for the Eastern US during the summer months of June, July and August 2001, obtaining, respectively, a 5% and a 15% predictive gain in overall predictive mean square error over our earlier downscaler model (Berrocal et al., 2010b). Perhaps more importantly, the predictive gain is greater at hold-out sites that are far from monitoring sites.