NobleBlocks

The JAX Cancer Center

Hospital / health systemBar Harbor, United States

Research output, citation impact, and the most-cited recent papers from The JAX Cancer Center. Aggregated across the NobleBlocks index of 300M+ scholarly works.

Total works
54
Citations
2.8K
h-index
26
i10-index
34
Also known as
The JAX Cancer CenterThe Jackson Laboratory Cancer Center

Top-cited papers from The JAX Cancer Center

Human iPSC-derived microglia assume a primary microglia-like state after transplantation into the neonatal mouse brain
Devon S. Svoboda, M. Inmaculada Barrasa, Jian Shu, Rosalie Rietjens +4 more
2019· Proceedings of the National Academy of Sciences221doi:10.1073/pnas.1913541116

Microglia are essential for maintenance of normal brain function, with dysregulation contributing to numerous neurological diseases. Protocols have been developed to derive microglia-like cells from human induced pluripotent stem cells (hiPSCs). However, primary microglia display major differences in morphology and gene expression when grown in culture, including down-regulation of signature microglial genes. Thus, in vitro differentiated microglia may not accurately represent resting primary microglia. To address this issue, we transplanted microglial precursors derived in vitro from hiPSCs into neonatal mouse brains and found that the cells acquired characteristic microglial morphology and gene expression signatures that closely resembled primary human microglia. Single-cell RNA-sequencing analysis of transplanted microglia showed similar cellular heterogeneity as primary human cells. Thus, hiPSCs-derived microglia transplanted into the neonatal mouse brain assume a phenotype and gene expression signature resembling that of resting microglia residing in the human brain, making chimeras a superior tool to study microglia in human disease.

CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence
Yue Zhao, Ziwei Pan, Sandeep Namburi, Andrew Pattison +4 more
2020· EBioMedicine133doi:10.1016/j.ebiom.2020.103030

BACKGROUND: Cancer of unknown primary (CUP), representing approximately 3-5% of all malignancies, is defined as metastatic cancer where a primary site of origin cannot be found despite a standard diagnostic workup. Because knowledge of a patient's primary cancer remains fundamental to their treatment, CUP patients are significantly disadvantaged and most have a poor survival outcome. Developing robust and accessible diagnostic methods for resolving cancer tissue of origin, therefore, has significant value for CUP patients. METHODS: We developed an RNA-based classifier called CUP-AI-Dx that utilizes a 1D Inception convolutional neural network (1D-Inception) model to infer a tumor's primary tissue of origin. CUP-AI-Dx was trained using the transcriptional profiles of 18,217 primary tumours representing 32 cancer types from The Cancer Genome Atlas project (TCGA) and International Cancer Genome Consortium (ICGC). Gene expression data was ordered by gene chromosomal coordinates as input to the 1D-CNN model, and the model utilizes multiple convolutional kernels with different configurations simultaneously to improve generality. The model was optimized through extensive hyperparameter tuning, including different max-pooling layers and dropout settings. For 11 tumour types, we also developed a random forest model that can classify the tumour's molecular subtype according to prior TCGA studies. The optimised CUP-AI-Dx tissue of origin classifier was tested on 394 metastatic samples from 11 tumour types from TCGA and 92 formalin-fixed paraffin-embedded (FFPE) samples representing 18 cancer types from two clinical laboratories. The CUP-AI-Dx molecular subtype was also independently tested on independent ovarian and breast cancer microarray datasets FINDINGS: CUP-AI-Dx identifies the primary site with an overall top-1-accuracy of 98.54% in cross-validation and 96.70% on a test dataset. When applied to two independent clinical-grade RNA-seq datasets generated from two different institutes from the US and Australia, our model predicted the primary site with a top-1-accuracy of 86.96% and 72.46% respectively. INTERPRETATION: The CUP-AI-Dx predicts tumour primary site and molecular subtype with high accuracy and therefore can be used to assist the diagnostic work-up of cancers of unknown primary or uncertain origin using a common and accessible genomics platform. FUNDING: NIH R35 GM133562, NCI P30 CA034196, Victorian Cancer Agency Australia.

CRISPR artificial splicing factors
Menghan Du, Nathaniel Jillette, Jacqueline Jufen Zhu, Sheng Li +1 more
2020· Nature Communications121doi:10.1038/s41467-020-16806-4

Alternative splicing allows expression of mRNA isoforms from a single gene, expanding the diversity of the proteome. Its prevalence in normal biological and disease processes warrant precise tools for modulation. Here we report the engineering of CRISPR Artificial Splicing Factors (CASFx) based on RNA-targeting CRISPR-Cas systems. We show that simultaneous exon inclusion and exclusion can be induced at distinct targets by differential positioning of CASFx. We also create inducible CASFx (iCASFx) using the FKBP-FRB chemical-inducible dimerization domain, allowing small molecule control of alternative splicing. Finally, we demonstrate the activation of SMN2 exon 7 splicing in spinal muscular atrophy (SMA) patient fibroblasts, suggesting a potential application of the CASFx system.

Toward a comprehensive view of cancer immune responsiveness: a synopsis from the SITC workshop
Society for Immunotherapy of Cancer (SITC) Cancer Immune Responsiveness Task Force and Working Groups, Davide Bedognetti, Michele Ceccarelli, Lorenzo Galluzzi +4 more
2019· Journal for ImmunoTherapy of Cancer88doi:10.1186/s40425-019-0602-4

Tumor immunology has changed the landscape of cancer treatment. Yet, not all patients benefit as cancer immune responsiveness (CIR) remains a limitation in a considerable proportion of cases. The multifactorial determinants of CIR include the genetic makeup of the patient, the genomic instability central to cancer development, the evolutionary emergence of cancer phenotypes under the influence of immune editing, and external modifiers such as demographics, environment, treatment potency, co-morbidities and cancer-independent alterations including immune homeostasis and polymorphisms in the major and minor histocompatibility molecules, cytokines, and chemokines. Based on the premise that cancer is fundamentally a disorder of the genes arising within a cell biologic process, whose deviations from normality determine the rules of engagement with the host's response, the Society for Immunotherapy of Cancer (SITC) convened a task force of experts from various disciplines including, immunology, oncology, biophysics, structural biology, molecular and cellular biology, genetics, and bioinformatics to address the complexity of CIR from a holistic view. The task force was launched by a workshop held in San Francisco on May 14-15, 2018 aimed at two preeminent goals: 1) to identify the fundamental questions related to CIR and 2) to create an interactive community of experts that could guide scientific and research priorities by forming a logical progression supported by multiple perspectives to uncover mechanisms of CIR. This workshop was a first step toward a second meeting where the focus would be to address the actionability of some of the questions identified by working groups. In this event, five working groups aimed at defining a path to test hypotheses according to their relevance to human cancer and identifying experimental models closest to human biology, which include: 1) Germline-Genetic, 2) Somatic-Genetic and 3) Genomic-Transcriptional contributions to CIR, 4) Determinant(s) of Immunogenic Cell Death that modulate CIR, and 5) Experimental Models that best represent CIR and its conversion to an immune responsive state. This manuscript summarizes the contributions from each group and should be considered as a first milestone in the path toward a more contemporary understanding of CIR. We appreciate that this effort is far from comprehensive and that other relevant aspects related to CIR such as the microbiome, the individual's recombined T cell and B cell receptors, and the metabolic status of cancer and immune cells were not fully included. These and other important factors will be included in future activities of the taskforce. The taskforce will focus on prioritization and specific actionable approach to answer the identified questions and implementing the collaborations in the follow-up workshop, which will be held in Houston on September 4-5, 2019.

Graph embedding and unsupervised learning predict genomic sub-compartments from HiC chromatin interaction data
Haitham Ashoor, Xiaowen Chen, Wojciech Rosikiewicz, Jiahui Wang +4 more
2020· Nature Communications68doi:10.1038/s41467-020-14974-x

Chromatin interaction studies can reveal how the genome is organized into spatially confined sub-compartments in the nucleus. However, accurately identifying sub-compartments from chromatin interaction data remains a challenge in computational biology. Here, we present Sub-Compartment Identifier (SCI), an algorithm that uses graph embedding followed by unsupervised learning to predict sub-compartments using Hi-C chromatin interaction data. We find that the network topological centrality and clustering performance of SCI sub-compartment predictions are superior to those of hidden Markov model (HMM) sub-compartment predictions. Moreover, using orthogonal Chromatin Interaction Analysis by in-situ Paired-End Tag Sequencing (ChIA-PET) data, we confirmed that SCI sub-compartment prediction outperforms HMM. We show that SCI-predicted sub-compartments have distinct epigenetic marks, transcriptional activities, and transcription factor enrichment. Moreover, we present a deep neural network to predict sub-compartments using epigenome, replication timing, and sequence data. Our neural network predicts more accurate sub-compartment predictions when SCI-determined sub-compartments are used as labels for training.

ChIA-PIPE: A fully automated pipeline for comprehensive ChIA-PET data analysis and visualization
Byoungkoo Lee, Jiahui Wang, Liuyang Cai, Minji Kim +4 more
2020· Science Advances49doi:10.1126/sciadv.aay2078

ChIA-PET (chromatin interaction analysis with paired-end tags) enables genome-wide discovery of chromatin interactions involving specific protein factors, with base pair resolution. Interpretation of ChIA-PET data requires a robust analytic pipeline. Here, we introduce ChIA-PIPE, a fully automated pipeline for ChIA-PET data processing, quality assessment, visualization, and analysis. ChIA-PIPE performs linker filtering, read mapping, peak calling, and loop calling and automates quality control assessment for each dataset. To enable visualization, ChIA-PIPE generates input files for two-dimensional contact map viewing with Juicebox and HiGlass and provides a new dockerized visualization tool for high-resolution, browser-based exploration of peaks and loops. To enable structural interpretation, ChIA-PIPE calls chromatin contact domains, resolves allele-specific peaks and loops, and annotates enhancer-promoter loops. ChIA-PIPE also supports the analysis of other related chromatin-mapping data types.

TET2 deficiency reprograms the germinal center B cell epigenome and silences genes linked to lymphomagenesis
Wojciech Rosikiewicz, Xiaowen Chen, Pilar M. Domínguez, Hussein Ghamlouch +4 more
2020· Science Advances48doi:10.1126/sciadv.aay5872

The TET2 DNA hydroxymethyltransferase is frequently disrupted by somatic mutations in diffuse large B cell lymphomas (DLBCLs), a tumor that originates from germinal center (GC) B cells. Here, we show that TET2 deficiency leads to DNA hypermethylation of regulatory elements in GC B cells, associated with silencing of the respective genes. This hypermethylation affects the binding of transcription factors including those involved in exit from the GC reaction and involves pathways such as B cell receptor, antigen presentation, CD40, and others. Normal GC B cells manifest a typical hypomethylation signature, which is caused by AID, the enzyme that mediates somatic hypermutation. However, AID-induced demethylation is markedly impaired in TET2-deficient GC B cells, suggesting that AID epigenetic effects are partially dependent on TET2. Last, we find that TET2 mutant DLBCLs also manifest the aberrant TET2-deficient GC DNA methylation signature, suggesting that this epigenetic pattern is maintained during and contributes to lymphomagenesis.

C11orf95-RELA reprograms 3D epigenome in supratentorial ependymoma
Jacqueline Jufen Zhu, Nathaniel Jillette, Xiao‐Nan Li, Albert W. Cheng +1 more
2020· Acta Neuropathologica43doi:10.1007/s00401-020-02225-8

Supratentorial ependymoma (ST-EPN) is a type of malignant brain tumor mainly seen in children. Since 2014, it has been known that an intrachromosomal fusion C11orf95-RELA is an oncogenic driver in ST-EPN [Parker et al. Nature 506:451-455 (2014); Pietsch et al. Acta Neuropathol 127:609-611 (2014)] but the molecular mechanisms of oncogenesis are unclear. Here we show that the C11orf95 component of the fusion protein dictates DNA binding activity while the RELA component is required for driving the expression of ependymoma-associated genes. Epigenomic characterizations using ChIP-seq and HiChIP approaches reveal that C11orf95-RELA modulates chromatin states and mediates chromatin interactions, leading to transcriptional reprogramming in ependymoma cells. Our findings provide important characterization of the molecular underpinning of C11orf95-RELA fusion and shed light on potential therapeutic targets for C11orf95-RELA subtype ependymoma.

DNA methylation calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation
Yang Liu, Wojciech Rosikiewicz, Ziwei Pan, Nathaniel Jillette +4 more
2021· bioRxiv (Cold Spring Harbor Laboratory)23doi:10.1101/2021.05.05.442849

Abstract Background Nanopore long-read sequencing technology greatly expands the capacity of long-range single-molecule DNA-modification detection. A growing number of analytical tools have been actively developed to detect DNA methylation from Nanopore sequencing reads. Here, we examine the performance of different methylation calling tools to provide a systematic evaluation to guide practitioners for human epigenome-wide research. Results We compare five analytic frameworks for detecting DNA modification from Nanopore long-read sequencing data. We evaluate the association between genomic context, CpG methylation-detection accuracy, CpG sites coverage, and running time using Nanopore sequencing data from natural human DNA. Furthermore, we provide an online DNA methylation database ( https://nanome.jax.org ) with which to display genomic regions that exhibit differences in DNA-modification detection power among different methylation calling algorithms for nanopore sequencing data. Conclusions Our study is the first benchmark of computational methods for mammalian whole genome DNA-modification detection in Nanopore sequencing. We provide a broad foundation for cross-platform standardization, and an evaluation of analytical tools designed for genome-scale modified-base detection using Nanopore sequencing.

EphB4/EphrinB2 therapeutics in Rhabdomyosarcoma
Matthew E. Randolph, Megan M. Cleary, Zia Bajwa, Matthew N. Svalina +4 more
2017· PLoS ONE17doi:10.1371/journal.pone.0183161

Rhabdomyosarcoma (RMS) is the most common soft tissue sarcoma affecting children and is often diagnosed with concurrent metastases. Unfortunately, few effective therapies have been discovered that improve the long-term survival rate for children with metastatic disease. Here we determined effectiveness of targeting the receptor tyrosine kinase, EphB4, in both alveolar and embryonal RMS either directly through the inhibitory antibody, VasG3, or indirectly by blocking both forward and reverse signaling of EphB4 binding to EphrinB2, cognate ligand of EphB4. Clinically, EphB4 expression in eRMS was correlated with longer survival. Experimentally, inhibition of EphB4 with VasG3 in both aRMS and eRMS orthotopic xenograft and allograft models failed to alter tumor progression. Inhibition of EphB4 forward signaling using soluble EphB4 protein fused with murine serum albumin failed to affect eRMS model tumor progression, but did moderately slow progression in murine aRMS. We conclude that inhibition of EphB4 signaling with these agents is not a viable monotherapy for rhabdomyosarcoma.

Nanopore detection of bacterial DNA base modifications
Alexa B. R. McIntyre, Noah Alexander, Aaron S. Burton, Sarah L. Castro-Wallace +4 more
2017· bioRxiv (Cold Spring Harbor Laboratory)15doi:10.1101/127100

Abstract The common bacterial base modification N6-methyladenine (m 6 A) is involved in many pathways related to an organism’s ability to survive and interact with its environment. Recent research has shown that nanopore sequencing can detect m 5 C with per-read accuracy of upwards of 80% but m 6 A with significantly lower accuracy. Here we use a binary classifier to improve m 6 A classification by marking adenines as methylated or unmethylated based on differences between measured and expected current values as each adenine travels through the nanopore. We also illustrate the importance of read quality for base modification detection and compare to PacBio methylation calls. With recent demonstrations of nanopore sequencing in Antarctica and onboard the International Space Station, the ability to reliably characterize m 6 A presents an opportunity to further examine the role of methylation in bacterial adaptation to extreme or very remote environments.

An Artifact in Intracellular Cytokine Staining for Studying T Cell Responses and Its Alleviation
Zheng Gong, Qing Li, Jiayuan Shi, Guangwen Ren
2022· Frontiers in Immunology13doi:10.3389/fimmu.2022.759188

Intracellular cytokine staining (ICS) is a widely employed ex vivo method for quantitative determination of the activation status of immune cells, most often applied to T cells. ICS test samples are commonly prepared from animal or human tissues as unpurified cell mixtures, and cell-specific cytokine signals are subsequently discriminated by gating strategies using flow cytometry. Here, we show that when ICS samples contain Ly6G + neutrophils, neutrophils are ex vivo activated by an ICS reagent – phorbol myristate acetate (PMA) – which leads to hydrogen peroxide (H 2 O 2 ) release and death of cytokine-expressing T cells. This artifact is likely to result in overinterpretation of the degree of T cell suppression, misleading immunological research related to cancer, infection, and inflammation. We accordingly devised easily implementable improvements to the ICS method and propose alternative methods for assessing or confirming cellular cytokine expression.

epihet for intra-tumoral epigenetic heterogeneity analysis and visualization
Xiaowen Chen, Haitham Ashoor, Ryan J. Musich, Jiahui Wang +4 more
2021· Scientific Reports11doi:10.1038/s41598-020-79627-x

Intra-tumoral epigenetic heterogeneity is an indicator of tumor population fitness and is linked to the deregulation of transcription. However, there is no published computational tool to automate the measurement of intra-tumoral epigenetic allelic heterogeneity. We developed an R/Bioconductor package, epihet, to calculate the intra-tumoral epigenetic heterogeneity and to perform differential epigenetic heterogeneity analysis. Furthermore, epihet can implement a biological network analysis workflow for transforming cancer-specific differential epigenetic heterogeneity loci into cancer-related biological function and clinical biomarkers. Finally, we demonstrated epihet utility on acute myeloid leukemia. We found statistically significant differential epigenetic heterogeneity (DEH) loci compared to normal controls and constructed co-epigenetic heterogeneity network and modules. epihet is available at https://bioconductor.org/packages/release/bioc/html/epihet.html .

Simultaneous multifunctional transcriptome engineering by CRISPR RNA scaffold
Zukai Liu, Nathaniel Jillette, Paul Robson, Albert W. Cheng
2023· Nucleic Acids Research8doi:10.1093/nar/gkad547

RNA processing and metabolism are subjected to precise regulation in the cell to ensure integrity and functions of RNA. Though targeted RNA engineering has become feasible with the discovery and engineering of the CRISPR-Cas13 system, simultaneous modulation of different RNA processing steps remains unavailable. In addition, off-target events resulting from effectors fused with dCas13 limit its application. Here we developed a novel platform, Combinatorial RNA Engineering via Scaffold Tagged gRNA (CREST), which can simultaneously execute multiple RNA modulation functions on different RNA targets. In CREST, RNA scaffolds are appended to the 3' end of Cas13 gRNA and their cognate RNA binding proteins are fused with enzymatic domains for manipulation. Taking RNA alternative splicing, A-to-G and C-to-U base editing as examples, we developed bifunctional and tri-functional CREST systems for simultaneously RNA manipulation. Furthermore, by fusing two split fragments of the deaminase domain of ADAR2 to dCas13 and/or PUFc respectively, we reconstituted its enzyme activity at target sites. This split design can reduce nearly 99% of off-target events otherwise induced by a full-length effector. The flexibility of the CREST framework will enrich the transcriptome engineering toolbox for the study of RNA biology.

CRISPR-mediated Multiplexed Live Cell Imaging of Nonrepetitive Genomic Loci
Patricia A. Clow, Menghan Du, Nathaniel Jillette, Aziz Taghbalout +2 more
2020· bioRxiv (Cold Spring Harbor Laboratory)7doi:10.1101/2020.03.03.974923

Abstract Three-dimensional (3D) structures of the genome are dynamic, heterogeneous and functionally important. Live cell imaging has become the leading method for chromatin dynamics tracking. However, existing CRISPR- and TALE-based genomic labeling techniques have been hampered by laborious protocols and are ineffective in labeling non-repetitive sequences. Here, we report a versatile CRISPR/Casilio-based imaging method that allows for a nonrepetitive genomic locus to be labeled using one guide RNA. We construct Casilio dual-color probes to visualize the dynamic interactions of DNA elements in single live cells in the presence or absence of the cohesin subunit RAD21. Using a three-color palette, we track the dynamic 3D locations of multiple reference points along a chromatin loop. Casilio imaging reveals intercellular heterogeneity and interallelic asynchrony in chromatin interaction dynamics, underscoring the importance of studying genome structures in 4D.

Integration of <scp>EpiSign</scp>, facial phenotyping, and likelihood ratio interpretation of clinical abnormalities in the re‐classification of an <scp><i>ARID1B</i></scp> missense variant
Caitlin Forwood, Katie A. Ashton, Ying Zhu, Futao Zhang +4 more
2023· American Journal of Medical Genetics Part C Seminars in Medical Genetics7doi:10.1002/ajmg.c.32056

Heterozygous ARID1B variants result in Coffin-Siris syndrome. Features may include hypoplastic nails, slow growth, characteristic facial features, hypotonia, hypertrichosis, and sparse scalp hair. Most reported cases are due to ARID1B loss of function variants. We report a boy with developmental delay, feeding difficulties, aspiration, recurrent respiratory infections, slow growth, and hypotonia without a clinical diagnosis, where a previously unreported ARID1B missense variant was classified as a variant of uncertain significance. The pathogenicity of this variant was refined through combined methodologies including genome-wide methylation signature analysis (EpiSign), Machine Learning (ML) facial phenotyping, and LIRICAL. Trio exome sequencing and EpiSign were performed. ML facial phenotyping compared facial images using FaceMatch and GestaltMatcher to syndrome-specific libraries to prioritize the trio exome bioinformatic pipeline gene list output. Phenotype-driven variant prioritization was performed with LIRICAL. A de novo heterozygous missense variant, ARID1B p.(Tyr1268His), was reported as a variant of uncertain significance. The ACMG classification was refined to likely pathogenic by a supportive methylation signature, ML facial phenotyping, and prioritization through LIRICAL. The ARID1B genotype-phenotype has been expanded through an extended analysis of missense variation through genome-wide methylation signatures, ML facial phenotyping, and likelihood-ratio gene prioritization.

Quantifying interpretation reproducibility in Vision Transformer models with TAVAC
Yue Zhao, Dylan Agyemang, Yang Liu, J. Matthew Mahoney +1 more
2024· Science Advances6doi:10.1126/sciadv.abg0264

Deep learning algorithms can extract meaningful diagnostic features from biomedical images, promising improved patient care in digital pathology. Vision Transformer (ViT) models capture long-range spatial relationships and offer robust prediction power and better interpretability for image classification tasks than convolutional neural network models. However, limited annotated biomedical imaging datasets can cause ViT models to overfit, leading to false predictions due to random noise. To address this, we introduce Training Attention and Validation Attention Consistency (TAVAC), a metric for evaluating ViT model overfitting and quantifying interpretation reproducibility. By comparing high-attention regions between training and testing, we tested TAVAC on four public image classification datasets and two independent breast cancer histological image datasets. Overfitted models showed significantly lower TAVAC scores. TAVAC also distinguishes off-target from on-target attentions and measures interpretation generalization at a fine-grained cellular level. Beyond diagnostics, TAVAC enhances interpretative reproducibility in basic research, revealing critical spatial patterns and cellular structures of biomedical and other general nonbiomedical images.

Correction to: Toward a comprehensive view of cancer immune responsiveness: a synopsis from the SITC workshop
Society for Immunotherapy of Cancer (SITC) Cancer Immune Responsiveness Task Force and Working Groups, Davide Bedognetti, Michele Ceccarelli, Lorenzo Galluzzi +4 more
2019· Journal for ImmunoTherapy of Cancer6doi:10.1186/s40425-019-0640-y

Following publication of the original article [1], the author reported that an author name, Roberta Zappasodi, was missed in the authorship list. Davide Bedognetti, Rongze Lu, Josue Samayoa, Stefani Spranger and Sarah Warren contributed equally to this work. The original article can be found online at 10.1186/s40425-019-0602-4

Personalized Medicine: Does the Molecular Suit Fit?
Edison T. Liu, Patrick G. Johnston
2013· The Oncologist6doi:10.1634/theoncologist.2013-0191

Editor's Note: Accompanying this article is an online video roundtable (http://sto-online.org/european_perspectives) in which the authors highlight the successes but emphasize the challenges that the oncology community faces in delivering on the promise of personalized medicine.

Pan-cancer machine learning predictors of primary site of origin and molecular subtype
William F. Flynn, Sandeep Namburi, Carolyn A. Paisie, Honey V. Reddi +3 more
2018· bioRxiv (Cold Spring Harbor Laboratory)5doi:10.1101/333914

ABSTRACT Background It is estimated by the American Cancer Society that approximately 5% of all metastatic tumors have no defined primary site (tissue) of origin and are classified as c ancers of u nknown p rimary (CUPs). The current standard of care for CUP patients depends on immunohistochemistry (IHC) based approaches to identify the primary site. The addition of post-mortem evaluation to IHC based tests helps to reveal the identity of the primary site for only 25% of the CUPs, emphasizing the acute need for better methods of determination of the site of origin. CUP patients are therefore given generic chemotherapeutic agents resulting in poor prognosis. When the tissue of origin is known, patients can be given site specific therapy with significant improvement in clinical outcome. Similarly, identifying the primary site of origin of metastatic cancer is of great importance for designing treatment. Identification of the primary site of origin is an import first step but may not be sufficient information for optimal treatment of the patient. Recent studies, primarily from The Cancer Genome Atlas (TCGA) project, and others, have revealed molecular subtypes in several cancer types with distinct clinical outcome. The molecular subtype captures the fundamental mechanisms driving the cancer and provides information that is essential for the optimal treatment of a cancer. Thus, along with primary site of origin, molecular subtype of a tumor is emerging as a criterion for personalized medicine and patient entry into clinical trials. However, there is no comprehensive toolset available for precise identification of tissue of origin or molecular subtype for precision medicine and translational research. Methods and Findings We posited that metastatic tumors will harbor the gene expression profiles of the primary site of origin of the cancer. Therefore, we decided to learn the molecular characteristics of the primary tumors using the large number of cancer genome profiles available from the TCGA project. Our predictors were trained for 33 cancer types and for the 11 cancers where there are established molecular subtypes. We estimated the accuracy of several machine learning models using cross-validation methods. The extensive testing using independent test sets revealed that the predictors had a median sensitivity and specificity of 97.2% and 99.9% respectively without losing classification of any tumor. Subtype classifiers achieved median sensitivity of 87.7% and specificity of 94.5% via cross validation and presented median sensitivity of 79.6% and specificity of 94.6% in two external datasets of 1,999 total samples. Importantly, these external data shows that our classifiers can robustly predict the primary site of origin from external microarray data, metastatic cancer data, and patient-derived xenograft (PDX) data. Conclusion We have demonstrated the utility of gene expression profiles to solve the important clinical challenge of identifying the primary site of origin and the molecular subtype of cancers based on machine learning algorithms. We show, for the first time to our knowledge, that our pan-cancer classifiers can predict multiple cancers’ primary site of origin from metastatic samples. The predictors will be made available as open source software, freely available for academic non-commercial use.