Texas Advanced Computing Center
facilityAustin, United States
Research output, citation impact, and the most-cited recent papers from Texas Advanced Computing Center. Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Texas Advanced Computing Center
We have optimized and extended the widely used annotation engine MAKER in order to better support plant genome annotation efforts. New features include better parallelization for large repeat-rich plant genomes, noncoding RNA annotation capabilities, and support for pseudogene identification. We have benchmarked the resulting software tool kit, MAKER-P, using the Arabidopsis (Arabidopsis thaliana) and maize (Zea mays) genomes. Here, we demonstrate the ability of the MAKER-P tool kit to automatically update, extend, and revise the Arabidopsis annotations in light of newly available data and to annotate pseudogenes and noncoding RNAs absent from The Arabidopsis Informatics Resource 10 build. Our results demonstrate that MAKER-P can be used to manage and improve the annotations of even Arabidopsis, perhaps the best-annotated plant genome. We have also installed and benchmarked MAKER-P on the Texas Advanced Computing Center. We show that this public resource can de novo annotate the entire Arabidopsis and maize genomes in less than 3 h and produce annotations of comparable quality to those of the current The Arabidopsis Information Resource 10 and maize V2 annotation builds.
We present the first ground-based detection of sodium absorption in the transmission spectrum of an extrasolar planet. Absorption due to the atmosphere of the extrasolar planet HD 189733b is detected in both lines of the Na I doublet. High spectral resolution observations were taken of 11 transits with the High Resolution Spectrograph ( HRS) on the 9.2 m Hobby-Eberly Telescope ( HET). The Na I absorption in the transmission spectrum due to HD 189733b is (-67.2 +/- 20.7) x 10(-5) deeper in the "narrow" spectral band that encompasses both lines relative to adjacent bands. The 1 sigma error includes both random and systematic errors, and the detection is > 3 sigma. This amount of relative absorption in Na I for HD 189733b is similar to 3 times larger than that detected for HD 209458b by Charbonneau et al. ( 2002) and indicates that these two hot Jupiters may have significantly different atmospheric properties.
High-performance execution in distributed computing environments often requires careful selection and configuration not only of computers, networks, and other resources but also of the protocols and algorithms used by applications. Selection and configuration in turn require access to accurate, up-to-date information on the structure and state of available resources. Unfortunately, no standard mechanism exists for organizing or accessing such information. Consequently, different tools and applications adopt ad hoc mechanisms, or they compromise their portability and performance by using default configurations. We propose a solution to this problem: a Metacomputing Directory Service that provides efficient and scalable access to diverse, dynamic, and distributed information about resource structure and state. We define an extensible data model to represent the information required for distributed computing, and we present a scalable, high-performance, distributed implementation. The dat...
Today most systems in high-performance computing (HPC) feature a hierarchical hardware design: Shared memory nodes with several multi-core CPUs are connected via a network infrastructure. Parallel programming must combine distributed memory parallelization on the node interconnect with shared memory parallelization inside each node. We describe potentials and challenges of the dominant programming models on hierarchically structured hardware: Pure MPI (Message Passing Interface), pure OpenMP (with distributed shared memory extensions) and hybrid MPI+OpenMP in several flavors. We pinpoint cases where a hybrid programming model can indeed be the superior solution because of reduced communication needs and memory consumption, or improved load balance. Furthermore we show that machine topology has a significant impact on performance for all parallelization strategies and that topology awareness should be built into all applications in the future. Finally we give an outlook on possible standardization goals and extensions that could make hybrid programming easier to do with performance in mind.
Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. Starting in April 2020, the US COVID-19 Forecast Hub (https://covid19forecasthub.org/) collected, disseminated, and synthesized tens of millions of specific predictions from more than 90 different academic, industry, and independent research groups. A multimodel ensemble forecast that combined predictions from dozens of groups every week provided the most consistently accurate probabilistic forecasts of incident deaths due to COVID-19 at the state and national level from April 2020 through October 2021. The performance of 27 individual models that submitted complete forecasts of COVID-19 deaths consistently throughout this year showed high variability in forecast skill across time, geospatial units, and forecast horizons. Two-thirds of the models evaluated showed better accuracy than a naïve baseline model. Forecast accuracy degraded as models made predictions further into the future, with probabilistic error at a 20-wk horizon three to five times larger than when predicting at a 1-wk horizon. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks.
Natural hazards engineering plays an important role in minimizing the effects of natural hazards on society through the design of resilient and sustainable infrastructure. The DesignSafe cyberinfrastructure has been developed to enable and facilitate transformative research in natural hazards engineering, which necessarily spans across multiple disciplines and can take advantage of advancements in computation, experimentation, and data analysis. DesignSafe allows researchers to more effectively share and find data using cloud services, perform numerical simulations using high performance computing, and integrate diverse datasets so that researchers can make discoveries that were previously unattainable. This paper describes the design principles used in the cyberinfrastructure development process, introduces the main components of the DesignSafe cyberinfrastructure, and illustrates the use of the DesignSafe cyberinfrastructure in research in natural hazards engineering through various examples.
DNA methylation is a chromatin modification that is frequently associated with epigenetic regulation in plants and mammals. However, genetic changes such as transposon insertions can also lead to changes in DNA methylation. Genome-wide profiles of DNA methylation for 20 maize (Zea mays) inbred lines were used to discover differentially methylated regions (DMRs). The methylation level for each of these DMRs was also assayed in 31 additional maize or teosinte genotypes, resulting in the discovery of 1966 common DMRs and 1754 rare DMRs. Analysis of recombinant inbred lines provides evidence that the majority of DMRs are heritable. A local association scan found that nearly half of the DMRs with common variation are significantly associated with single nucleotide polymorphisms found within or near the DMR. Many of the DMRs that are significantly associated with local genetic variation are found near transposable elements that may contribute to the variation in DNA methylation. Analysis of gene expression in the same samples used for DNA methylation profiling identified over 300 genes with expression patterns that are significantly associated with DNA methylation variation. Collectively, our results suggest that DNA methylation variation is influenced by genetic and epigenetic changes that are often stably inherited and can influence the expression of nearby genes.
The National Heart, Lung, and Blood Institute is funding an effort to create a molecular atlas of the developing lung (LungMAP) to serve as a research resource and public education tool. The lung is a complex organ with lengthy development time driven by interactive gene networks and dynamic cross talk among multiple cell types to control and coordinate lineage specification, cell proliferation, differentiation, migration, morphogenesis, and injury repair. A better understanding of the processes that regulate lung development, particularly alveologenesis, will have a significant impact on survival rates for premature infants born with incomplete lung development and will facilitate lung injury repair and regeneration in adults. A consortium of four research centers, a data coordinating center, and a human tissue repository provides high-quality molecular data of developing human and mouse lungs. LungMAP includes mouse and human data for cross correlation of developmental processes across species. LungMAP is generating foundational data and analysis, creating a web portal for presentation of results and public sharing of data sets, establishing a repository of young human lung tissues obtained through organ donor organizations, and developing a comprehensive lung ontology that incorporates the latest findings of the consortium. The LungMAP website (www.lungmap.net) currently contains more than 6,000 high-resolution lung images and transcriptomic, proteomic, and lipidomic human and mouse data and provides scientific information to stimulate interest in research careers for young audiences. This paper presents a brief description of research conducted by the consortium, database, and portal development and upcoming features that will enhance the LungMAP experience for a community of users.
We introduce a fingerprint representation of molecules based on a Fourier series of atomic radial distribution functions. This fingerprint is unique (except for chirality), continuous, and differentiable with respect to atomic coordinates and nuclear charges. It is invariant with respect to translation, rotation, and nuclear permutation, and requires no preconceived knowledge about chemical bonding, topology, or electronic orbitals. As such, it meets many important criteria for a good molecular representation, suggesting its usefulness for machine learning models of molecular properties trained across chemical compound space. To assess the performance of this new descriptor, we have trained machine learning models of molecular enthalpies of atomization for training sets with up to 10 k organic molecules, drawn at random from a published set of 134 k organic molecules with an average atomization enthalpy of over 1770 kcal/mol. We validate the descriptor on all remaining molecules of the 134 k set. For a training set of 10 k molecules, the fingerprint descriptor achieves a mean absolute error of 8.0 kcal/mol. This is slightly worse than the performance attained using the Coulomb matrix, another popular alternative, reaching 6.2 kcal/mol for the same training and test sets. © 2015 Wiley Periodicals, Inc.
The Sloan Digital Sky Survey III's Apache Point Observatory Galactic Evolution Experiment (APOGEE) is a high-resolution near-infrared spectroscopic survey covering all of the major components of the Galaxy, including the dust-obscured regions of the inner Milky Way disk and bulge. Here we present a sample of 10,341 likely red-clump stars (RC) from the first two years of APOGEE operations, selected based on their position in color-metallicity-surface-gravity-effective-temperature space using a new method calibrated using stellar evolution models and high-quality asteroseismology data. The narrowness of the RC locus in color-metallicity-luminosity space allows us to assign distances to the stars with an accuracy of 5%-10%. The sample extends to typical distances of about 3 kpc from the Sun, with some stars out to 8 kpc, and spans a volume of approximately 100 kpc3 over 5 kpc <~ R <~ 14 kpc, |Z| <~ 2 kpc, and --15° <~ Galactocentric azimuth <~ 30°. The APOGEE red-clump (APOGEE-RC) catalog contains photometry from the Two Micron All Sky Survey, reddening estimates, distances, line-of-sight velocities, stellar parameters and elemental abundances determined from the high-resolution APOGEE spectra, and matches to major proper motion catalogs. We determine the survey selection function for this data set and discuss how the RC selection samples the underlying stellar populations. We use this sample to limit any azimuthal variations in the median metallicity within the &ap;45° azimuthal region covered by the current sample to be <=0.02 dex, which is more than an order of magnitude smaller than the radial metallicity gradient. This result constrains coherent non-axisymmetric flows within a few kiloparsecs from the Sun.
The Arabidopsis Information Portal (https://www. araport.org) is a new online resource for plant biology research. It houses the Arabidopsis thaliana genome sequence and associated annotation. It was conceived as a framework that allows the research community to develop and release 'modules' that integrate, analyze and visualize Arabidopsis data that may reside at remote sites. The current implementation provides an indexed database of core genomic information. These data are made available through feature-rich web applications that provide search, data mining, and genome browser functionality, and also by bulk download and web services. Araport uses software from the InterMine and JBrowse projects to expose curated data from TAIR, GO, BAR, EBI, UniProt, PubMed and EPIC CoGe. The site also hosts 'science apps,' developed as prototypes for community modules that use dynamic web pages to present data obtained on-demand from third-party servers via RESTful web services. Designed for sustainability, the Arabidopsis Information Portal strategy exploits existing scientific computing infrastructure, adopts a practical mixture of data integration technologies and encourages collaborative enhancement of the resource by its user community.
Imprinting describes the differential expression of alleles based on their parent of origin. Deep sequencing of RNAs from maize (Zea mays) endosperm and embryo tissue 14 d after pollination was used to identify imprinted genes among a set of ~12,000 genes that were expressed and contained sequence polymorphisms between the B73 and Mo17 genotypes. The analysis of parent-of-origin patterns of expression resulted in the identification of 100 putative imprinted genes in maize endosperm, including 54 maternally expressed genes (MEGs) and 46 paternally expressed genes (PEGs). Three of these genes have been previously identified as imprinted, while the remaining 97 genes represent novel imprinted maize genes. A genome-wide analysis of DNA methylation identified regions with reduced endosperm DNA methylation in, or near, 19 of the 100 imprinted genes. The reduced levels of DNA methylation in endosperm are caused by hypomethylation of the maternal allele for both MEGs and PEGs in all cases tested. Many of the imprinted genes with reduced DNA methylation levels also show endosperm-specific expression patterns. The imprinted maize genes were compared with imprinted genes identified in genome-wide screens of rice (Oryza sativa) and Arabidopsis thaliana, and at least 10 examples of conserved imprinting between maize and each of the other species were identified.
Effective virtual screening relies on our ability to make accurate prediction of protein-ligand binding, which remains a great challenge. In this work, utilizing the molecular-mechanics Poisson-Boltzmann (or Generalized Born) surface area approach, we have evaluated the binding affinity of a set of 156 ligands to seven families of proteins, trypsin β, thrombin α, cyclin-dependent kinase (CDK), cAMP-dependent kinase (PKA), urokinase-type plasminogen activator, β-glucosidase A, and coagulation factor Xa. The effect of protein dielectric constant in the implicit-solvent model on the binding free energy calculation is shown to be important. The statistical correlations between the binding energy calculated from the implicit-solvent approach and experimental free energy are in the range of 0.56-0.79 across all the families. This performance is better than that of typical docking programs especially given that the latter is directly trained using known binding data whereas the molecular mechanics is based on general physical parameters. Estimation of entropic contribution remains the barrier to accurate free energy calculation. We show that the traditional rigid rotor harmonic oscillator approximation is unable to improve the binding free energy prediction. Inclusion of conformational restriction seems to be promising but requires further investigation. On the other hand, our preliminary study suggests that implicit-solvent based alchemical perturbation, which offers explicit sampling of configuration entropy, can be a viable approach to significantly improve the prediction of binding free energy. Overall, the molecular mechanics approach has the potential for medium to high-throughput computational drug discovery.
DNA methylation can play important roles in the regulation of transposable elements and genes. A collection of mutant alleles for 11 maize (Zea mays) genes predicted to play roles in controlling DNA methylation were isolated through forward- or reverse-genetic approaches. Low-coverage whole-genome bisulfite sequencing and high-coverage sequence-capture bisulfite sequencing were applied to mutant lines to determine context- and locus-specific effects of these mutations on DNA methylation profiles. Plants containing mutant alleles for components of the RNA-directed DNA methylation pathway exhibit loss of CHH methylation at many loci as well as CG and CHG methylation at a small number of loci. Plants containing loss-of-function alleles for chromomethylase (CMT) genes exhibit strong genome-wide reductions in CHG methylation and some locus-specific loss of CHH methylation. In an attempt to identify stocks with stronger reductions in DNA methylation levels than provided by single gene mutations, we performed crosses to create double mutants for the maize CMT3 orthologs, Zmet2 and Zmet5, and for the maize DDM1 orthologs, Chr101 and Chr106. While loss-of-function alleles are viable as single gene mutants, the double mutants were not recovered, suggesting that severe perturbations of the maize methylome may have stronger deleterious phenotypic effects than in Arabidopsis thaliana.
Abstract PubMed ® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID ® , and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.
We develop a generalizable AI-driven workflow that leverages heterogeneous HPC resources to explore the time-dependent dynamics of molecular systems. We use this workflow to investigate the mechanisms of infectivity of the SARS-CoV-2 spike protein, the main viral infection machinery. Our workflow enables more efficient investigation of spike dynamics in a variety of complex environments, including within a complete SARS-CoV-2 viral envelope simulation, which contains 305 million atoms and shows strong scaling on ORNL Summit using NAMD. We present several novel scientific discoveries, including the elucidation of the spike's full glycan shield, the role of spike glycans in modulating the infectivity of the virus, and the characterization of the flexible interactions between the spike and the human ACE2 receptor. We also demonstrate how AI can accelerate conformational sampling across different systems and pave the way for the future application of such methods to additional studies in SARS-CoV-2 and other molecular systems.
As part of the NSF’s cyberinfrastructure vision for a robust mix of high capability and capacity HPC systems, Frontera represents the most recent evolution of trans-petascale resources available to all open science research projects in the U.S. Debuting as the fifth largest supercomputer in the world, Frontera represents a robust and well-balanced HPC system designed to enable large-scale, productive science on day one of operations. The system provides a primary compute capability of nearly 39PF, delivered completely via more than 8,000 dual-socket servers with conventional Intel 8280 (“Cascade Lake”) processors. A unique configuration of both desktop GPUs and advanced floating units from NVIDIA enables both machine learning and scientific workloads, and the system delivers nearly 2TB/s of total filesystem bandwidth with 55 PB of usable Lustre disk-based storage and 3PB of all flash Lustre storage. A Mellanox InfiniBand (IB) interconnect provides very low latency with 100Gbps to each node, and 200Gbps between switches in a fat tree topology with minimal oversubscription for efficient communication, even in jobs that use the full system with complex communication patterns. The system hardware is complemented by a robust set of software services, including Application Programmer Interfaces (APIs) to support an evolving user base that increasingly demands productive access via science gateways and automated workflows, as well as a first-of-its-kind partnership with the three major cloud service providers to create a bridge between “traditional” HPC and the cloud infrastructure upon which research increasingly depends.
Scientific data is continually increasing in complexity, variety and size, making efficient visualization and specifically rendering an ongoing challenge. Traditional rasterization-based visualization approaches encounter performance and quality limitations, particularly in HPC environments without dedicated rendering hardware. In this paper, we present OSPRay, a turn-key CPU ray tracing framework oriented towards production-use scientific visualization which can utilize varying SIMD widths and multiple device backends found across diverse HPC resources. This framework provides a high-quality, efficient CPU-based solution for typical visualization workloads, which has already been integrated into several prevalent visualization packages. We show that this system delivers the performance, high-level API simplicity, and modular device support needed to provide a compelling new rendering framework for implementing efficient scientific visualization workflows.
Abstract AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (i) tackle new tasks, like protein-ligand complex structure prediction, (ii) investigate the process by which the model learns, which remains poorly understood, and (iii) assess the model’s generalization capacity to unseen regions of fold space. Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2. We train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold’s capacity to generalize across fold space by retraining it using carefully designed datasets. We find that OpenFold is remarkably robust at generalizing despite extreme reductions in training set size and diversity, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced by OpenFold during training, we also gain surprising insights into the manner in which the model learns to fold proteins, discovering that spatial dimensions are learned sequentially. Taken together, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial new resource for the protein modeling community.
We present the $H$-band spectral line lists adopted by the Apache Point Observatory Galactic Evolution Experiment (APOGEE). The APOGEE line lists comprise astrophysical, theoretical, and laboratory sources from the literature, as well as newly evaluated astrophysical oscillator strengths and damping parameters. We discuss the construction of the APOGEE line list, which is one of the critical inputs for the APOGEE Stellar Parameters and Chemical Abundances Pipeline, and present three different versions that have been used at various stages of the project. The methodology for the newly calculated astrophysical line lists is reviewed. The largest of these three line lists contains 134,457 molecular and atomic transitions. In addition to the format adopted to store the data, the line lists are available in MOOG, Synspec and Turbospectrum formats. We also present a list of $H$-band spectral features that are either poorly represented or completely missing in our line list. This list is based on the average of a large number of spectral fit residuals for APOGEE observations spanning a wide range of stellar parameters.