NobleBlocks
Broad Institute logo

Broad Institute

nonprofitCambridge, Massachusetts, United States

Research output, citation impact, and the most-cited recent papers from Broad Institute (United States). Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
62.0K
Citations
34.5M
h-index
2592
i10-index
77.1K
Also known as
Broad InstituteBroad Institute of MIT and HarvardEl Instituto BroadEli and Edythe L. Broad Institute of MIT and Harvard

Top-cited papers from Broad Institute

The Sequence Alignment/Map format and SAMtools
Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell +4 more
2009· Bioinformatics67.0Kdoi:10.1093/bioinformatics/btp352

SUMMARY: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. AVAILABILITY: http://samtools.sourceforge.net.

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data
Aaron McKenna, Matthew G. Hanna, Eric Banks, Andrey Sivachenko +4 more
2010· Genome Research29.8Kdoi:10.1101/gr.107524.110

Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

A global reference for human genetic variation
Corresponding authors, Adam Auton, Gonçalo R. Abecasis, David M. Altshuler +4 more
2015· Nature19.8Kdoi:10.1038/nature15393

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies. Results for the final phase of the 1000 Genomes Project are presented including whole-genome sequencing, targeted exome sequencing, and genotyping on high-density SNP arrays for 2,504 individuals across 26 populations, providing a global reference data set to support biomedical genetics. The 1000 Genomes Project has sought to comprehensively catalogue human genetic variation across populations, providing a valuable public genomic resource. The data obtained so far have found applications ranging from association studies and fine mapping studies to the filtering of likely neutral variants in rare-disease cohorts. The authors now report on the final phase of the project, phase 3, which covers previously uncharacterized areas of human genetic diversity in terms of the populations sampled and categories of characterized variation. The sample now includes more than 2,500 individuals from 26 global populations, with low coverage whole-genome and deep exome sequencing, as well as dense microarray genotyping. They find that while most common variants are shared across populations, rarer variants are often restricted to closely related populations. The authors also demonstrate the use of the phase 3 dataset as a reference panel for imputation to improve the resolution in genetic association studies.

Model-based Analysis of ChIP-Seq (MACS)
Yong Zhang, Tao Liu, Clifford A. Meyer, Jérôme Eeckhoute +4 more
2008· Genome biology19.8Kdoi:10.1186/gb-2008-9-9-r137

We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, and is freely available.

The variant call format and VCFtools
Petr Danecek, Adam Auton, Gonçalo R. Abecasis, Cornelis A. Albers +4 more
2011· Bioinformatics17.6Kdoi:10.1093/bioinformatics/btr330

SUMMARY: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. AVAILABILITY: http://vcftools.sourceforge.net

Minimap2: pairwise alignment for nucleotide sequences
Heng Li
2018· Bioinformatics16.5Kdoi:10.1093/bioinformatics/bty191

Motivation: Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation: https://github.com/lh3/minimap2. Supplementary information: Supplementary data are available at Bioinformatics online.

Metagenomic biomarker discovery and explanation
Nicola Segata, Jacques Izard, Levi Waldron, Dirk Gevers +3 more
2011· Genome biology16.5Kdoi:10.1186/gb-2011-12-6-r60

This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.

UCHIME improves sensitivity and speed of chimera detection
R. C. Edgar, Brian J. Haas, José C. Clemente, Christopher Quince +1 more
2011· Bioinformatics15.5Kdoi:10.1093/bioinformatics/btr381

MOTIVATION: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments. RESULTS: We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer. CONTACT: robert@drive5.com AVAILABILITY: Source, binaries and data: http://drive5.com/uchime. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Haploview: analysis and visualization of LD and haplotype maps
Jeffrey C. Barrett, Ben Fry, Julian Maller, Mark Daly
2004· Bioinformatics14.7Kdoi:10.1093/bioinformatics/bth457

UNLABELLED: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface. AVAILABILITY: http://www.broad.mit.edu/mpg/haploview/ CONTACT: jcbarret@broad.mit.edu

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher Chang, Carson C. Chow, Laurent CAM Tellier, Shashaank Vattikuti +2 more
2015· GigaScience13.8Kdoi:10.1186/s13742-015-0047-8

BACKGROUND: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. FINDINGS: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, [Formula: see text]-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). CONCLUSIONS: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel +2 more
2013· Genome biology13.6Kdoi:10.1186/gb-2013-14-4-r36

TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

Structure, function and diversity of the healthy human microbiome
Curtis Huttenhower, J. Fah Sathirapongsasuti, Nicola Segata,  Curtis Huttenhower +4 more
2012· Nature11.9Kdoi:10.1038/nature11234

Studies of the human microbiome have revealed that even healthy individuals differ remarkably in the microbes that occupy habitats such as the gut, skin and vagina. Much of this diversity remains unexplained, although diet, environment, host genetics and early microbial exposure have all been implicated. Accordingly, to characterize the ecology of human-associated microbial communities, the Human Microbiome Project has analysed the largest cohort and set of distinct, clinically relevant body habitats so far. We found the diversity and abundance of each habitat’s signature microbes to vary widely even among healthy subjects, with strong niche specialization both within and among individuals. The project encountered an estimated 81–99% of the genera, enzyme families and community configurations occupied by the healthy Western microbiome. Metagenomic carriage of metabolic pathways was stable among individuals despite variation in community structure, and ethnic/racial background proved to be one of the strongest associations of both pathways and microbes with clinical metadata. These results thus delineate the range of structural and functional configurations normal in the microbial communities of a healthy population, enabling future characterization of the epidemiology, ecology and translational applications of the human microbiome. The Human Microbiome Project Consortium reports the first results of their analysis of microbial communities from distinct, clinically relevant body habitats in a human cohort; the insights into the microbial communities of a healthy population lay foundations for future exploration of the epidemiology, ecology and translational applications of the human microbiome. The Human Microbiome Project (HMP), supported by the National Institutes of Health Common Fund, has the goal of characterizing the microbial communities that inhabit and interact with the human body in sickness and in health. In two Articles in this issue of Nature, the HMP Consortium presents the first population-scale details of the organismal and functional composition of the microbiota across five areas of the body. An associated News & Views discusses the initial results — which, along with those of a series of co-publications, already constitute the most extensive catalogue of organisms and genes related to the human microbiome yet published — and highlights some of the major questions that the project will tackle in the next few years.

Jalview Version 2—a multiple sequence alignment editor and analysis workbench
Andrew Waterhouse, James B Procter, David Martin, Michèle Clamp +1 more
2009· Bioinformatics10.8Kdoi:10.1093/bioinformatics/btp033

UNLABELLED: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server. AVAILABILITY: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org.

Inferring tumour purity and stromal and immune cell admixture from expression data
Kosuke Yoshihara, Maria Shahmoradgoli, Emmanuel Martínez, Rahulsimham Vegesna +4 more
2013· Nature Communications10.7Kdoi:10.1038/ncomms3612

Infiltrating stromal and immune cells form the major fraction of normal cells in tumour tissue and not only perturb the tumour signal in molecular studies but also have an important role in cancer biology. Here we describe ‘Estimation of STromal and Immune cells in MAlignant Tumours using Expression data’ (ESTIMATE)—a method that uses gene expression signatures to infer the fraction of stromal and immune cells in tumour samples. ESTIMATE scores correlate with DNA copy number-based tumour purity across samples from 11 different tumour types, profiled on Agilent, Affymetrix platforms or based on RNA sequencing and available through The Cancer Genome Atlas. The prediction accuracy is further corroborated using 3,809 transcriptional profiles available elsewhere in the public domain. The ESTIMATE method allows consideration of tumour-associated normal cells in genomic and transcriptomic studies. An R-library is available on https://sourceforge.net/projects/estimateproject/ . Tumour biopsies contain contaminating normal cells and these can influence the analysis of tumour samples. In this study, Yoshihara et al.develop an algorithm based on gene expression profiles from The Cancer Genome Atlas to estimate the number of contaminating normal cells in tumour samples.

Analysis of protein-coding genetic variation in 60,706 humans
Monkol Lek, Konrad J. Karczewski, Eric Vallabh Minikel, Kaitlin E. Samocha +4 more
2016· Nature10.3Kdoi:10.1038/nature19057

Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement
Bruce J. Walker, Thomas Abeel, Terrance Shea, Margaret Priest +4 more
2014· PLoS ONE10.1Kdoi:10.1371/journal.pone.0112963

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3-5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

The mutational constraint spectrum quantified from variation in 141,456 humans
Konrad J. Karczewski, Laurent C. Francioli, Grace Tiao, Beryl B. Cummings +4 more
2020· Nature10.0Kdoi:10.1038/s41586-020-2308-7

Abstract Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1 . Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
Helga Thorvaldsdóttir, James Robinson, Jill P. Mesirov
2012· Briefings in Bioinformatics9.7Kdoi:10.1093/bib/bbs017

Data visualization is an essential component of genomic data analysis. However, the size and diversity of the data sets produced by today's sequencing and array-based profiling methods present major challenges to visualization tools. The Integrative Genomics Viewer (IGV) is a high-performance viewer that efficiently handles large heterogeneous data sets, while providing a smooth and intuitive user experience at all levels of genome resolution. A key characteristic of IGV is its focus on the integrative nature of genomic studies, with support for both array-based and next-generation sequencing data, and the integration of clinical and phenotypic data. Although IGV is often used to view genomic data from public sources, its primary emphasis is to support researchers who wish to visualize and explore their own data sets or those from colleagues. To that end, IGV supports flexible loading of local and remote data sets, and is optimized to provide high-performance data visualization and exploration on standard desktop systems. IGV is freely available for download from http://www.broadinstitute.org/igv, under a GNU LGPL open-source license.

Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome
Erez Lieberman-Aiden, Nynke L. van Berkum, Louise Williams, Maxim Imakaev +4 more
2009· Science9.6Kdoi:10.1126/science.1181369

We describe Hi-C, a method that probes the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing. We constructed spatial proximity maps of the human genome with Hi-C at a resolution of 1 megabase. These maps confirm the presence of chromosome territories and the spatial proximity of small, gene-rich chromosomes. We identified an additional level of genome organization that is characterized by the spatial segregation of open and closed chromatin to form two genome-wide compartments. At the megabase scale, the chromatin conformation is consistent with a fractal globule, a knot-free, polymer conformation that enables maximally dense packing while preserving the ability to easily fold and unfold any genomic locus. The fractal globule is distinct from the more commonly used globular equilibrium model. Our results demonstrate the power of Hi-C to map the dynamic conformations of whole genomes.

<i>EGFR</i> Mutations in Lung Cancer: Correlation with Clinical Response to Gefitinib Therapy
J. Guillermo Paez, Pasi A. Jänne, Jeffrey C. Lee, Sean Tracy +4 more
2004· Science9.4Kdoi:10.1126/science.1099314

Receptor tyrosine kinase genes were sequenced in non-small cell lung cancer (NSCLC) and matched normal tissue. Somatic mutations of the epidermal growth factor receptor gene EGFR were found in 15of 58 unselected tumors from Japan and 1 of 61 from the United States. Treatment with the EGFR kinase inhibitor gefitinib (Iressa) causes tumor regression in some patients with NSCLC, more frequently in Japan. EGFR mutations were found in additional lung cancer samples from U.S. patients who responded to gefitinib therapy and in a lung adenocarcinoma cell line that was hypersensitive to growth inhibition by gefitinib, but not in gefitinib-insensitive tumors or cell lines. These results suggest that EGFR mutations may predict sensitivity to gefitinib.