Chan Zuckerberg Biohub San Francisco
facilitySan Francisco, United States
Research output, citation impact, and the most-cited recent papers from Chan Zuckerberg Biohub San Francisco (United States). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from Chan Zuckerberg Biohub San Francisco
The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.
Tumor evolution is driven by the progressive acquisition of genetic and epigenetic alterations that enable uncontrolled growth and expansion to neighboring and distal tissues. The study of phylogenetic relationships between cancer cells provides key insights into these processes. Here, we introduced an evolving lineage-tracing system with a single-cell RNA-seq readout into a mouse model of Kras;Trp53(KP)-driven lung adenocarcinoma and tracked tumor evolution from single-transformed cells to metastatic tumors at unprecedented resolution. We found that the loss of the initial, stable alveolar-type2-like state was accompanied by a transient increase in plasticity. This was followed by the adoption of distinct transcriptional programs that enable rapid expansion and, ultimately, clonal sweep of stable subclones capable of metastasizing. Finally, tumors develop through stereotypical evolutionary trajectories, and perturbing additional tumor suppressors accelerates progression by creating novel trajectories. Our study elucidates the hierarchical nature of tumor evolution and, more broadly, enables in-depth studies of tumor progression.
Adoptive cell therapy using engineered T cell receptors (TCRs) is a promising approach for targeting cancer antigens, but tumor-reactive TCRs are often weakly responsive to their target ligands, peptide-major histocompatibility complexes (pMHCs). Affinity-matured TCRs can enhance the efficacy of TCR-T cell therapy but can also cross-react with off-target antigens, resulting in organ immunopathology. We developed an alternative strategy to isolate TCR mutants that exhibited high activation signals coupled with low-affinity pMHC binding through the acquisition of catch bonds. Engineered analogs of a tumor antigen MAGE-A3-specific TCR maintained physiological affinities while exhibiting enhanced target killing potency and undetectable cross-reactivity, compared with a high-affinity clinically tested TCR that exhibited lethal cross-reactivity with a cardiac antigen. Catch bond engineering is a biophysically based strategy to tune high-sensitivity TCRs for T cell therapy with reduced potential for adverse cross-reactivity.
Fluorescence microscopy is a key driver of discoveries in the life-sciences, with observable phenomena being limited by the optics of the microscope, the chemistry of the fluorophores, and the maximum photon exposure tolerated by the sample. These limits necessitate trade-offs between imaging speed, spatial resolution, light exposure, and imaging depth. In this work we show how image restoration based on deep learning extends the range of biological phenomena observable by microscopy. On seven concrete examples we demonstrate how microscopy images can be restored even if 60-fold fewer photons are used during acquisition, how near isotropic resolution can be achieved with up to 10-fold under-sampling along the axial direction, and how tubular and granular structures smaller than the diffraction limit can be resolved at 20-times higher frame-rates compared to state-of-the-art methods. All developed image restoration methods are freely available as open source software in Python, F iji , and K nime .
Developing a universal representation space for cells which encompasses the tremendous molecular diversity of cell types within the human body and more generally, across species, would be transformative for cell biology. Recent work using single-cell transcriptomic approaches to create molecular definitions of cell types in the form of cell atlases has provided the necessary data for such an endeavor. Here, we present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from human and other species in a completely self-supervised way without any data annotations. UCE's modeling approach is to create a unified biological latent space that can represent cells across diverse tissues and species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets. An important aspect of UCE's universality is that new cells can be mapped to this embedding space with no additional data labeling, model training or fine-tuning. We applied UCE to create the Integrated Mega-scale Atlas, embedding 36 million cells, with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues and eight species. We uncovered new insights about the organization of cell types and tissues within this universal cell embedding space, and leveraged it to infer function of newly discovered cell types. UCE's embedding space exhibits emergent behavior, uncovering biology that it was never explicitly trained for, such as identifying developmental lineages and embedding data from novel species not included in the training set. Overall, by enabling a universal representation for every cell state and type, UCE provides a valuable tool for analysis, annotation and hypothesis generation over single cell data.
SARS-CoV-2 infection primarily targets the lung but may also damage other organs, including the brain, heart, kidney, and intestine. Central nervous system (CNS) pathologies include loss of smell and taste, headache, delirium, acute psychosis, seizures, and stroke. Pathological loss of gray matter occurs in SARS-CoV-2 infection, but it is unclear whether this is due to direct viral infection, indirect effects associated with systemic inflammation, or both. Here, we used induced pluripotent stem cell (iPSC)-derived brain organoids and primary human astrocytes from the cerebral cortex to study direct SARS-CoV-2 infection. Our findings support a model where SARS-CoV-2 infection of astrocytes produces a panoply of changes in the expression of genes regulating innate immune signaling and inflammatory responses. The deregulation of these genes in astrocytes produces a microenvironment within the CNS that ultimately disrupts normal neuron function, promoting neuronal cell death and CNS deficits.
Mammalian genomes have multiple enhancers spanning an ultralong distance (>megabases) to modulate important genes, but it is unclear how these enhancers coordinate to achieve this task. We combine multiplexed CRISPRi screening with machine learning to define quantitative enhancer-enhancer interactions. We find that the ultralong distance enhancer network has a nested multilayer architecture that confers functional robustness of gene expression. Experimental characterization reveals that enhancer epistasis is maintained by three-dimensional chromosomal interactions and BRD4 condensation. Machine learning prediction of synergistic enhancers provides an effective strategy to identify noncoding variant pairs associated with pathogenic genes in diseases beyond genome-wide association studies analysis. Our work unveils nested epistasis enhancer networks, which can better explain enhancer functions within cells and in diseases.
RNA interference, which involves the delivery of small interfering RNA (siRNA), has been used to validate target genes, to understand and control cellular metabolic pathways, and to use as a “green” alternative to confer pest tolerance in crops. Conventional siRNA delivery methods such as viruses and Agrobacterium-mediated delivery exhibit plant species range limitations and uncontrolled DNA integration into the plant genome. Here, we synthesize polyethylenimine-functionalized gold nanoclusters (PEI-AuNCs) to mediate siRNA delivery into intact plants and show that these nanoclusters enable efficient gene knockdown. We further demonstrate that PEI-AuNCs protect siRNA from RNase degradation while the complex is small enough to bypass the plant cell wall. Consequently, AuNCs enable gene knockdown with efficiencies of up 76.5 ± 5.9% and 76.1 ± 9.5% for GFP and ROQ1, respectively, with no observable toxicity. Our data suggest that AuNCs can deliver siRNA into intact plant cells for broad applications in plant biotechnology.
OBJECTIVES: The Stopping Cavities Trial investigated effectiveness and safety of 38% silver diamine fluoride in arresting caries lesions. MATERIALS AND METHODS: The study was a double-blind randomized placebo-controlled superiority trial with 2 parallel groups. The sites were Oregon preschools. Sixty-six preschool children with ≥1 lesion were enrolled. Silver diamine fluoride (38%) or placebo (blue-tinted water), applied topically to the lesion. The primary endpoint was caries arrest (lesion inactivity, Nyvad criteria) 14-21days post intervention. Dental plaque was collected from all children, and microbial composition was assessed by RNA sequencing from 2 lesions and 1 unaffected surface before treatment and at follow-up for 3 children from each group. RESULTS AND CONCLUSION: Average proportion of arrested caries lesions in the silver diamine fluoride group was higher (0.72; 95% CI; 0.55, 0.84) than in the placebo group (0.05; 95% CI; 0.00, 0.16). Confirmatory analysis using generalized estimating equation log-linear regression, based on the number of arrested lesions and accounting for the number of treated surfaces and length of follow-up, indicates the risk of arrested caries was significantly higher in the treatment group (relative risk, 17.3; 95% CI: 4.3 to 69.4). No harms were observed. RNA sequencing analysis identified no consistent changes in relative abundance of caries-associated microbes, nor emergence of antibiotic or metal resistance gene expression. Topical 38% silver diamine fluoride is effective and safe in arresting cavities in preschool children. CLINICAL SIGNIFICANCE: The treatment is applicable to primary care practice and may reduce the burden of untreated tooth decay in the population.
Single-cell ATAC sequencing (scATAC-seq) is a powerful and increasingly popular technique to explore the regulatory landscape of heterogeneous cellular populations. However, the high noise levels, degree of sparsity, and scale of the generated data make its analysis challenging. Here, we present PeakVI, a probabilistic framework that leverages deep neural networks to analyze scATAC-seq data. PeakVI fits an informative latent space that preserves biological heterogeneity while correcting batch effects and accounting for technical effects, such as library size and region-specific biases. In addition, PeakVI provides a technique for identifying differential accessibility at a single-region resolution, which can be used for cell-type annotation as well as identification of key cis-regulatory elements. We use public datasets to demonstrate that PeakVI is scalable, stable, robust to low-quality data, and outperforms current analysis methods on a range of critical analysis tasks. PeakVI is publicly available and implemented in the scvi-tools framework.
Biomedical repositories such as the UK Biobank provide increasing access to prospectively collected cardiac imaging, however these data are unlabeled, which creates barriers to their use in supervised machine learning. We develop a weakly supervised deep learning model for classification of aortic valve malformations using up to 4,000 unlabeled cardiac MRI sequences. Instead of requiring highly curated training data, weak supervision relies on noisy heuristics defined by domain experts to programmatically generate large-scale, imperfect training labels. For aortic valve classification, models trained with imperfect labels substantially outperform a supervised model trained on hand-labeled MRIs. In an orthogonal validation experiment using health outcomes data, our model identifies individuals with a 1.8-fold increase in risk of a major adverse cardiac event. This work formalizes a deep learning baseline for aortic valve classification and outlines a general strategy for using weak supervision to train machine learning models using unlabeled medical images at scale.
Contact guidance is a powerful topographical cue that induces persistent directional cell migration. Healthy tissue stroma is characterized by a meshwork of wavy extracellular matrix (ECM) fiber bundles, whereas metastasis-prone stroma exhibit less wavy, more linear fibers. The latter topography correlates with poor prognosis, whereas more wavy bundles correlate with benign tumors. We designed nanotopographic ECM-coated substrates that mimic collagen fibril waveforms seen in tumors and healthy tissues to determine how these nanotopographies may regulate cancer cell polarization and migration machineries. Cell polarization and directional migration were inhibited by fibril-like wave substrates above a threshold amplitude. Although polarity signals and actin nucleation factors were required for polarization and migration on low-amplitude wave substrates, they did not localize to cell leading edges. Instead, these factors localized to wave peaks, creating multiple "cryptic leading edges" within cells. On high-amplitude wave substrates, retrograde flow from large cryptic leading edges depolarized stress fibers and focal adhesions and inhibited cell migration. On low-amplitude wave substrates, actomyosin contractility overrode the small cryptic leading edges and drove stress fiber and focal adhesion orientation along the wave axis to mediate directional migration. Cancer cells of different intrinsic contractility depolarized at different wave amplitudes, and cell polarization response to wavy substrates could be tuned by manipulating contractility. We propose that ECM fibril waveforms with sufficiently high amplitude around tumors may serve as "cell polarization barriers," decreasing directional migration of tumor cells, which could be overcome by up-regulation of tumor cell contractility.
The genome is a sequence that completely encodes the DNA, RNA, and proteins that orchestrate the function of a whole organism. Advances in machine learning combined with massive datasets of whole genomes could enable a biological foundation model that accelerates the mechanistic understanding and generative design of complex molecular interactions. We report Evo, a genomic foundation model that enables prediction and generation tasks from the molecular to genome scale. Using an architecture based on advances in deep signal processing, we scale Evo to 7 billion parameters with a context length of 131 kilobases (kb) at single-nucleotide, byte resolution. Trained on whole prokaryotic genomes, Evo can generalize across the three fundamental modalities of the central dogma of molecular biology to perform zero-shot function prediction that is competitive with, or outperforms, leading domain-specific language models. Evo also excels at multi-element generation tasks, which we demonstrate by generating synthetic CRISPR-Cas molecular complexes and entire transposable systems for the first time. Using information learned over whole genomes, Evo can also predict gene essentiality at nucleotide resolution and can generate coding-rich sequences up to 650 kb in length, orders of magnitude longer than previous methods. Advances in multi-modal and multi-scale learning with Evo provides a promising path toward improving our understanding and control of biology across multiple levels of complexity.
Abstract Biology has become a data-intensive science. Recent technological advances in single-cell genomics have enabled the measurement of multiple facets of cellular state, producing datasets with millions of single-cell observations. While these data hold great promise for understanding molecular mechanisms in health and disease, analysis challenges arising from sparsity, technical and biological variability, and high dimensionality of the data hinder the derivation of such mechanistic insights. To promote the innovation of algorithms for analysis of multimodal single-cell data, we organized a competition at NeurIPS 2021 applying the Common Task Framework to multimodal single-cell data integration. For this competition we generated the first multimodal benchmarking dataset for single-cell biology and defined three tasks in this domain: prediction of missing modalities, aligning modalities, and learning a joint representation across modalities. We further specified evaluation metrics and developed a cloud-based algorithm evaluation pipeline. Using this setup, 280 competitors submitted over 2600 proposed solutions within a 3 month period, showcasing substantial innovation especially in the modality alignment task. Here, we present the results, describe trends of well performing approaches, and discuss challenges associated with running the competition.
In the past five years, droplet microfluidic techniques have unlocked new opportunities for the high-throughput genome-wide analysis of single cells, transforming our understanding of cellular diversity and function. However, the field lacks an accessible method to screen and sort droplets based on cellular phenotype upstream of genetic analysis, particularly for large and complex cells. To meet this need, we developed Dropception, a robust, easy-to-use workflow for precise single-cell encapsulation into picoliter-scale double emulsion droplets compatible with high-throughput screening via fluorescence-activated cell sorting (FACS). We demonstrate the capabilities of this method by encapsulating five standardized mammalian cell lines of varying sizes and morphologies as well as a heterogeneous cell mixture of a whole dissociated flatworm (5–25 μm in diameter) within highly monodisperse double emulsions (35 μm in diameter). We optimize for preferential encapsulation of single cells with extremely low multiple-cell loading events (<2% of cell-containing droplets), thereby allowing direct linkage of cellular phenotype to genotype. Across all cell lines, cell loading efficiency approaches the theoretical limit with no observable bias by cell size. FACS measurements reveal the ability to discriminate empty droplets from those containing cells with good agreement to single-cell occupancies quantified via microscopy, establishing robust droplet screening at single-cell resolution. High-throughput FACS screening of cellular picoreactors has the potential to shift the landscape of single-cell droplet microfluidics by expanding the repertoire of current nucleic acid droplet assays to include functional phenotyping.
Abstract Jointly profiling the transcriptional and chromatin accessibility landscapes of single-cells is a powerful technique to characterize cellular populations. Here we present MultiVI, a probabilistic model to analyze such multiomic data and integrate it with single modality datasets. MultiVI creates a joint representation that accurately reflects both chromatin and transcriptional properties of the cells even when one modality is missing. It also imputes missing data, corrects for batch effects and is available in the scvi-tools framework: https://docs.scvi-tools.org/ .
Henipaviruses include highly pathogenic emerging zoonotic viruses, derived from bat, rodent, and shrew reservoirs. Bat-borne Hendra (HeV) and Nipah (NiV) are the most well-known henipaviruses, for which no effective antivirals or vaccines for humans have been described. Here, we report the discovery and characterization of a novel henipavirus, Angavokely virus (AngV), isolated from wild fruit bats in Madagascar. Genomic characterization of AngV reveals all major features associated with pathogenicity in other henipaviruses, suggesting that AngV could be pathogenic following spillover to human hosts. Our work suggests that AngV is an ancestral bat henipavirus that likely uses viral entry pathways distinct from those previously described for HeV and NiV. In Madagascar, bats are consumed as a source of human food, presenting opportunities for cross-species transmission. Characterization of novel henipaviruses and documentation of their pathogenic and zoonotic potential are essential to predicting and preventing the emergence of future zoonoses that cause pandemics.
Skeletal stem and progenitor cell populations are crucial for bone physiology. Characterization of these cell types remains restricted to heterogenous bulk populations with limited information on whether they are unique or overlap with previously characterized cell types. Here we show, through comprehensive functional and single-cell transcriptomic analyses, that postnatal long bones of mice contain at least two types of bone progenitors with bona fide skeletal stem cell (SSC) characteristics. An early osteochondral SSC (ocSSC) facilitates long bone growth and repair, while a second type, a perivascular SSC (pvSSC), co-emerges with long bone marrow and contributes to shape the hematopoietic stem cell niche and regenerative demand. We establish that pvSSCs, but not ocSSCs, are the origin of bone marrow adipose tissue. Lastly, we also provide insight into residual SSC heterogeneity as well as potential crosstalk between the two spatially distinct cell populations. These findings comprehensively address previously unappreciated shortcomings of SSC research.
Cassava ( Manihot esculenta ) is a starchy root crop that supports over a billion people in tropical and subtropical regions of the world. This staple, however, produces the neurotoxin cyanide and requires processing for safe consumption. Excessive consumption of insufficiently processed cassava, in combination with protein-poor diets, can have neurodegenerative impacts. This problem is further exacerbated by drought conditions which increase this toxin in the plant. To reduce cyanide levels in cassava, we used CRISPR-mediated mutagenesis to disrupt the cytochrome P450 genes CYP79D1 and CYP79D2 whose protein products catalyze the first step in cyanogenic glucoside biosynthesis. Knockout of both genes eliminated cyanide in leaves and storage roots of cassava accession 60444; the West African, farmer-preferred cultivar TME 419; and the improved variety TMS 91/02324. Although knockout of CYP79D2 alone resulted in significant reduction of cyanide, mutagenesis of CYP79D1 did not, indicating these paralogs have diverged in their function. The congruence of results across accessions indicates that our approach could readily be extended to other preferred or improved cultivars. This work demonstrates cassava genome editing for enhanced food safety and reduced processing burden, against the backdrop of a changing climate.
The origin of the pentaradial body plan of echinoderms from a bilateral ancestor is one of the most enduring zoological puzzles1,2. Because echinoderms are defined by morphological novelty, even the most basic axial comparisons with their bilaterian relatives are problematic. To revisit this classical question, we used conserved anteroposterior axial molecular markers to determine whether the highly derived adult body plan of echinoderms masks underlying patterning similarities with other deuterostomes. We investigated the expression of a suite of conserved transcription factors with well-established roles in the establishment of anteroposterior polarity in deuterostomes3–5 and other bilaterians6–8 using RNA tomography and in situ hybridization in the sea star Patiria miniata. The relative spatial expression of these markers in P. miniata ambulacral ectoderm shows similarity with other deuterostomes, with the midline of each ray representing the most anterior territory and the most lateral parts exhibiting a more posterior identity. Strikingly, there is no ectodermal territory in the sea star that expresses the characteristic bilaterian trunk genetic patterning programme. This finding suggests that from the perspective of ectoderm patterning, echinoderms are mostly head-like animals and provides a developmental rationale for the re-evaluation of the events that led to the evolution of the derived adult body plan of echinoderms. RNA tomography and in situ hybridization in echinoderms suggest a new ambulacral-anterior model to relate echinoderm pentaradial symmetry to the ancestral bilateral symmetry.