European Molecular Biology Laboratory
governmentHamburg, Hamburg, Germany
Research output, citation impact, and the most-cited recent papers from European Molecular Biology Laboratory (Germany). Aggregated across the NobleBlocks index of 300M+ scholarly works.
Top-cited papers from European Molecular Biology Laboratory
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.
High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
A new software suite, called Crystallography & NMR System (CNS), has been developed for macromolecular structure determination by X-ray crystallography or solution nuclear magnetic resonance (NMR) spectroscopy. In contrast to existing structure-determination programs, the architecture of CNS is highly flexible, allowing for extension to other structure-determination methods, such as electron microscopy and solid-state NMR spectroscopy. CNS has a hierarchical structure: a high-level hypertext markup language (HTML) user interface, task-oriented user input files, module files, a symbolic structure-determination language (CNS language), and low-level source code. Each layer is accessible to the user. The novice user may just use the HTML interface, while the more advanced user may use any of the other layers. The source code will be distributed, thus source-code modification is possible. The CNS language is sufficiently powerful and flexible that many new algorithms can be easily implemented in the CNS language without changes to the source code. The CNS language allows the user to perform operations on data structures, such as structure factors, electron-density maps, and atomic properties. The power of the CNS language has been demonstrated by the implementation of a comprehensive set of crystallographic procedures for phasing, density modification and refinement. User-friendly task-oriented input files are available for nearly all aspects of macromolecular structure determination by X-ray crystallography and solution NMR.
The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database (http://string-db.org) aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein-protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.
Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.
Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein-protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research. This issue of Nature contains the first publication from The 1000 Genomes Project, an international collaboration that will produce an extensive public catalogue of human genetic variation. The plan, in fact, is to sequence about 2,000 unidentified individuals from 20 populations around the world. This first paper presents the results from the project's pilot phase, testing three different strategies for genome-wide sequencing with high-throughput platforms: low-coverage whole-genome sequencing of 179 individuals in three population groups, high-coverage sequencing of two mother–father–child trios, and exon-targeted sequencing of 697 individuals from seven populations. The goal of the 1000 Genomes Project is to provide in-depth information on variation in human genome sequences. In the pilot phase reported here, different strategies for genome-wide sequencing, using high-throughput sequencing platforms, were developed and compared. The resulting data set includes more than 95% of the currently accessible variants found in any individual, and can be used to inform association and functional studies.
A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein-protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein-protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.
AUTORES: Daniel J Klionsky1745,1749*, Kotb Abdelmohsen840, Akihisa Abe1237, Md Joynal Abedin1762, Hagai Abeliovich425, \nAbraham Acevedo Arozena789, Hiroaki Adachi1800, Christopher M Adams1669, Peter D Adams57, Khosrow Adeli1981, \nPeter J Adhihetty1625, Sharon G Adler700, Galila Agam67, Rajesh Agarwal1587, Manish K Aghi1537, Maria Agnello1826, \nPatrizia Agostinis664, Patricia V Aguilar1960, Julio Aguirre-Ghiso784,786, Edoardo M Airoldi89,422, Slimane Ait-Si-Ali1376, \nTakahiko Akematsu2010, Emmanuel T Akporiaye1097, Mohamed Al-Rubeai1394, Guillermo M Albaiceta1294, \nChris Albanese363, Diego Albani561, Matthew L Albert517, Jesus Aldudo128, Hana Alg€ul1164, Mehrdad Alirezaei1198, \nIraide Alloza642,888, Alexandru Almasan206, Maylin Almonte-Beceril524, Emad S Alnemri1212, Covadonga Alonso544, \nNihal Altan-Bonnet848, Dario C Altieri1205, Silvia Alvarez1497, Lydia Alvarez-Erviti1395, Sandro Alves107, \nGiuseppina Amadoro860, Atsuo Amano930, Consuelo Amantini1554, Santiago Ambrosio1458, Ivano Amelio756, \nAmal O Amer918, Mohamed Amessou2089, Angelika Amon726, Zhenyi An1538, Frank A Anania291, Stig U Andersen6, \nUsha P Andley2079, Catherine K Andreadi1690, Nathalie Andrieu-Abadie502, Alberto Anel2027, David K Ann58, \nShailendra Anoopkumar-Dukie388, Manuela Antonioli832,858, Hiroshi Aoki1791, Nadezda Apostolova2007, \nSaveria Aquila1500, Katia Aquilano1876, Koichi Araki292, Eli Arama2098, Agustin Aranda456, Jun Araya591, \nAlexandre Arcaro1472, Esperanza Arias26, Hirokazu Arimoto1225, Aileen R Ariosa1749, Jane L Armstrong1930, \nThierry Arnould1773, Ivica Arsov2120, Katsuhiko Asanuma675, Valerie Askanas1924, Eric Asselin1867, Ryuichiro Atarashi794, \nSally S Atherton369, Julie D Atkin713, Laura D Attardi1131, Patrick Auberger1787, Georg Auburger379, Laure Aurelian1727, \nRiccardo Autelli1992, Laura Avagliano1029,1755, Maria Laura Avantaggiati364, Limor Avrahami1166, Suresh Awale1986, \nNeelam Azad404, Tiziana Bachetti568, Jonathan M Backer28, Dong-Hun Bae1933, Jae-sung Bae677, Ok-Nam Bae409, \nSoo Han Bae2117, Eric H Baehrecke1729, Seung-Hoon Baek17, Stephen Baghdiguian1368, \nAgnieszka Bagniewska-Zadworna2, Hua Bai90, Jie Bai667, Xue-Yuan Bai1133, Yannick Bailly884, \nKithiganahalli Narayanaswamy Balaji473, Walter Balduini2002, Andrea Ballabio316, Rena Balzan1711, Rajkumar Banerjee239, \nG abor B anhegyi1052, Haijun Bao2109, Benoit Barbeau1363, Maria D Barrachina2007, Esther Barreiro467, Bonnie Bartel997, \nAlberto Bartolom e222, Diane C Bassham550, Maria Teresa Bassi1046, Robert C Bast Jr1273, Alakananda Basu1798, \nMaria Teresa Batista1578, Henri Batoko1336, Maurizio Battino970, Kyle Bauckman2085, Bradley L Baumgarner1909, \nK Ulrich Bayer1594, Rupert Beale1553, Jean-Fran¸cois Beaulieu1360, George R. Beck Jr48,294, Christoph Becker336, \nJ David Beckham1595, Pierre-Andr e B edard749, Patrick J Bednarski301, Thomas J Begley1135, Christian Behl1419, \nChristian Behrends757, Georg MN Behrens406, Kevin E Behrns1627, Eloy Bejarano26, Amine Belaid490, \nFrancesca Belleudi1041, Giovanni B enard497, Guy Berchem706, Daniele Bergamaschi983, Matteo Bergami1401, \nBen Berkhout1441, Laura Berliocchi714, Am elie Bernard1749, Monique Bernard1354, Francesca Bernassola1880, \nAnne Bertolotti791, Amanda S Bess272, S ebastien Besteiro1351, Saverio Bettuzzi1828, Savita Bhalla913, \nShalmoli Bhattacharyya973, Sujit K Bhutia838, Caroline Biagosch1159, Michele Wolfe Bianchi520,1378,1381, \nMartine Biard-Piechaczyk210, Viktor Billes298, Claudia Bincoletto1314, Baris Bingol350, Sara W Bird1128, Marc Bitoun1112, \nIvana Bjedov1258, Craig Blackstone843, Lionel Blanc1183, Guillermo A Blanco1496, Heidi Kiil Blomhoff1812, \nEmilio Boada-Romero1297, Stefan B€ockler1464, Marianne Boes1423, Kathleen Boesze-Battaglia1835, Lawrence H Boise286,287, \nAlessandra Bolino2063, Andrea Boman693, Paolo Bonaldo1823, Matteo Bordi897, J€urgen Bosch608, Luis M Botana1308, \nJoelle Botti1375, German Bou1405, Marina Bouch e1038, Marion Bouchecareilh1331, Marie-Jos ee Boucher1901, \nMichael E Boulton481, Sebastien G Bouret1926, Patricia Boya133, Micha€el Boyer-Guittaut1345, Peter V Bozhkov1141, \nNathan Brady374, Vania MM Braga469, Claudio Brancolini1997, Gerhard H Braus353, Jos e M Bravo-San Pedro299,393,508,1374, \nLisa A Brennan322, Emery H Bresnick2022, Patrick Brest490, Dave Bridges1939, Marie-Agn es Bringer124, Marisa Brini1822, \nGlauber C Brito1311, Bertha Brodin631, Paul S Brookes1872, Eric J Brown352, Karen Brown1690, Hal E Broxmeyer480, \nAlain Bruhat486,1339, Patricia Chakur Brum1893, John H Brumell446, Nicola Brunetti-Pierri315,1171, \nRobert J Bryson-Richardson781, Shilpa Buch1777, Alastair M Buchan1819, Hikmet Budak1022, Dmitry V Bulavin118,505,1789, \nScott J Bultman1792, Geert Bultynck665, Vladimir Bumbasirevic1470, Yan Burelle1356, Robert E Burke216,217, \nMargit Burmeister1750, Peter B€utikofer1473, Laura Caberlotto1987, Ken Cadwell896, Monika Cahova112, Dongsheng Cai24, \nJingjing Cai2099, Qian Cai1018, Sara Calatayud2007, Nadine Camougrand1343, Michelangelo Campanella1700, \nGrant R Campbell1525, Matthew Campbell1249, Silvia Campello556,1876, Robin Candau1769, Isabella Caniggia1983, \nLavinia Cantoni560, Lizhi Cao116, Allan B Caplan1656, Michele Caraglia1051, Claudio Cardinali1043, Sandra Morais Cardoso1579, Jennifer S Carew208, Laura A Carleton874, Cathleen R Carlin101, Silvia Carloni2002, \nSven R Carlsson1267, Didac Carmona-Gutierrez1643, Leticia AM Carneiro312, Oliana Carnevali971, Serena Carra1318, \nAlice Carrier120, Bernadette Carroll900, Caty Casas1324, Josefina Casas1116, Giuliana Cassinelli324, Perrine Castets1462, \nSusana Castro-Obregon214, Gabriella Cavallini1841, Isabella Ceccherini568, Francesco Cecconi253,555,1884, \nArthur I Cederbaum459, Valent ın Ce~na199,1281, Simone Cenci1323,2064, Claudia Cerella444, Davide Cervia1996, \nSilvia Cetrullo1478, Hassan Chaachouay2028, Han-Jung Chae187, Andrei S Chagin634, Chee-Yin Chai626,628, \nGopal Chakrabarti1502, Georgios Chamilos1601, Edmond YW Chan1142, Matthew TV Chan181, Dhyan Chandra1003, \nPallavi Chandra548, Chih-Peng Chang818, Raymond Chuen-Chung Chang1653, Ta Yuan Chang345, John C Chatham1434, \nSaurabh Chatterjee1910, Santosh Chauhan527, Yongsheng Che62, Michael E Cheetham1263, Rajkumar Cheluvappa1783, \nChun-Jung Chen1153, Gang Chen598,1676, Guang-Chao Chen9, Guoqiang Chen1078, Hongzhuan Chen1077, Jeff W Chen1514, \nJian-Kang Chen370,371, Min Chen249, Mingzhou Chen2104, Peiwen Chen1823, Qi Chen1674, Quan Chen172, \nShang-Der Chen138, Si Chen325, Steve S-L Chen10, Wei Chen2125, Wei-Jung Chen829, Wen Qiang Chen979, Wenli Chen1113, \nXiangmei Chen1133, Yau-Hung Chen1157, Ye-Guang Chen1250, Yin Chen1447, Yingyu Chen953,955, Yongshun Chen2135, \nYu-Jen Chen712, Yue-Qin Chen1145, Yujie Chen1208, Zhen Chen339, Zhong Chen2123, Alan Cheng1702, \nChristopher HK Cheng184, Hua Cheng1728, Heesun Cheong814, Sara Cherry1836, Jason Chesney1703, \nChun Hei Antonio Cheung817, Eric Chevet1359, Hsiang Cheng Chi140, Sung-Gil Chi656, Fulvio Chiacchiera308, \nHui-Ling Chiang958, Roberto Chiarelli1826, Mario Chiariello235,567,577, Marcello Chieppa835, Lih-Shen Chin290, \nMario Chiong1285, Gigi NC Chiu878, Dong-Hyung Cho676, Ssang-Goo Cho650, William C Cho982, Yong-Yeon Cho105, \nYoung-Seok Cho1064, Augustine MK Choi2095, Eui-Ju Choi656, Eun-Kyoung Choi387,400,685, Jayoung Choi1563, \nMary E Choi2093, Seung-Il Choi2116, Tsui-Fen Chou412, Salem Chouaib395, Divaker Choubey1574, Vinay Choubey1936, \nKuan-Chih Chow822, Kamal Chowdhury730, Charleen T Chu1856, Tsung-Hsien Chuang827, Taehoon Chun657, \nHyewon Chung652, Taijoon Chung978, Yuen-Li Chung1194, Yong-Joon Chwae18, Valentina Cianfanelli254, \nRoberto Ciarcia1775, Iwona A Ciechomska886, Maria Rosa Ciriolo1876, Mara Cirone1042, Sofie Claerhout1694, \nMichael J Clague1698, Joan Cl aria1457, Peter GH Clarke1687, Robert Clarke361, Emilio Clementi1045,1398, C edric Cleyrat1781, \nMiriam Cnop1366, Eliana M Coccia574, Tiziana Cocco1459, Patrice Codogno1375, J€orn Coers271, Ezra EW Cohen1533, \nDavid Colecchia235,567,577, Luisa Coletto25, N uria S Coll123, Emma Colucci-Guyon516, Sergio Comincini1829, \nMaria Condello578, Katherine L Cook2073, Graham H Coombs1929, Cynthia D Cooper2076, J Mark Cooper1395, \nIsabelle Coppens601, Maria Tiziana Corasaniti1387, Marco Corazzari485,1884, Ramon Corbalan1566, \nElisabeth Corcelle-Termeau251, Mario D Cordero1899, Cristina Corral-Ramos1289, Olga Corti507,1109, Andrea Cossarizza1767, \nPaola Costelli1993, Safia Costes1518, Susan L Cotman721, Ana Coto-Montes946, Sandra Cottet566,1688, Eduardo Couve1301, \nLori R Covey1015, L Ashley Cowart762, Jeffery S Cox1536, Fraser P Coxon1427, Carolyn B Coyne1846, Mark S Cragg1919, \nRolf J Craven1679, Tiziana Crepaldi1995, Jose L Crespo1300, Alfredo Criollo1285, Valeria Crippa558, Maria Teresa Cruz1576, \nAna Maria Cuervo26, Jose M Cuezva1277, Taixing Cui1907, Pedro R Cutillas987, Mark J Czaja27, Maria F Czyzyk-Krzeska1572, \nRuben K Dagda2068, Uta Dahmen1404, Chunsun Dai800, Wenjie Dai1187, Yun Dai2059, Kevin N Dalby1940, \nLuisa Dalla Valle1822, Guillaume Dalmasso1340, Marcello D’Amelio557, Markus Damme188, Arlette Darfeuille-Michaud1340, \nCatherine Dargemont950, Victor M Darley-Usmar1433, Srinivasan Dasarathy205, Biplab Dasgupta202, Srikanta Dash1254, \nCrispin R Dass242, Hazel Marie Davey8, Lester M Davids1560, David D avila227, Roger J Davis1731, Ted M Dawson604, \nValina L Dawson606, Paula Daza1898, Jackie de Belleroche470, Paul de Figueiredo1180,1182, \nRegina Celia Bressan Queiroz de Figueiredo135, Jos e de la Fuente1023, Luisa De Martino1775, \nAntonella De Matteis1171, Guido RY De Meyer1443, Angelo De Milito631, Mauro De Santi2002,
Here, we describe the third major release of RELION. CPU-based vector acceleration has been added in addition to GPU support, which provides flexibility in use of resources and avoids memory limitations. Reference-free autopicking with Laplacian-of-Gaussian filtering and execution of jobs from python allows non-interactive processing during acquisition, including 2D-classification, de novo model generation and 3D-classification. Per-particle refinement of CTF parameters and correction of estimated beam tilt provides higher resolution reconstructions when particles are at different heights in the ice, and/or coma-free alignment has not been optimal. Ewald sphere curvature correction improves resolution for large particles. We illustrate these developments with publicly available data sets: together with a Bayesian approach to beam-induced motion correction it leads to resolution improvements of 0.2–0.7 Å compared to previous RELION versions.
Interactive Tree Of Life (http://itol.embl.de) is a web-based tool for the display, manipulation and annotation of phylogenetic trees. It is freely available and open to everyone. The current version was completely redesigned and rewritten, utilizing current web technologies for speedy and streamlined processing. Numerous new features were introduced and several new data types are now supported. Trees with up to 100,000 leaves can now be efficiently displayed. Full interactive control over precise positioning of various annotation features and an unlimited number of datasets allow the easy creation of complex tree visualizations. iTOL 3 is the first tool which supports direct visualization of the recently proposed phylogenetic placements format. Finally, iTOL's account system has been redesigned to simplify the management of trees in user-defined workspaces and projects, as it is heavily used and currently handles already more than 500,000 trees from more than 10,000 individual users.
eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).
The potential of the diverse chemistries present in natural products (NP) for biotechnology and medicine remains untapped because NP databases are not searchable with raw data and the NP community has no way to share data other than in published papers. Although mass spectrometry (MS) techniques are well-suited to high-throughput characterization of NP, there is a pressing need for an infrastructure to enable sharing and curation of data. We present Global Natural Products Social Molecular Networking (GNPS; http://gnps.ucsd.edu), an open-access knowledge base for community-wide organization and sharing of raw, processed or identified tandem mass (MS/MS) spectrometry data. In GNPS, crowdsourced curation of freely available community-wide reference MS libraries will underpin improved annotations. Data-driven social-networking should facilitate identification of spectra and foster collaborations. We also introduce the concept of 'living data' through continuous reanalysis of deposited data.
Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: 1) de novo gene prediction from raw contigs, 2) built-in pairwise orthology prediction, 3) fast protein domain discovery, and 4) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.
Accurate multiple alignments of 86 domains that occur in signaling proteins have been constructed and used to provide a Web-based tool (SMART: simple modular architecture research tool) that allows rapid identification and annotation of signaling domain sequences. The majority of signaling proteins are multidomain in character with a considerable variety of domain combinations known. Comparison with established databases showed that 25% of our domain set could not be deduced from SwissProt and 41% could not be annotated by Pfam. SMART is able to determine the modular architectures of single sequences or genomes; application to the entire yeast genome revealed that at least 6.7% of its genes contain one or more signaling domains, approximately 350 greater than previously annotated. The process of constructing SMART predicted (i) novel domain homologues in unexpected locations such as band 4.1-homologous domains in focal adhesion kinases; (ii) previously unknown domain families, including a citron-homology domain; (iii) putative functions of domain families after identification of additional family members, for example, a ubiquitin-binding role for ubiquitin-associated domains (UBA); (iv) cellular roles for proteins, such predicted DEATH domains in netrin receptors further implicating these molecules in axonal guidance; (v) signaling domains in known disease genes such as SPRY domains in both marenostrin/pyrin and Midline 1; (vi) domains in unexpected phylogenetic contexts such as diacylglycerol kinase homologues in yeast and bacteria; and (vii) likely protein misclassifications exemplified by a predicted pleckstrin homology domain in a Candida albicans protein, previously described as an integrin.
PAL2NAL is a web server that constructs a multiple codon alignment from the corresponding aligned protein sequences. Such codon alignments can be used to evaluate the type and rate of nucleotide substitutions in coding DNA for a wide range of evolutionary analyses, such as the identification of levels of selective constraint acting on genes, or to perform DNA-based phylogenetic studies. The server takes a protein sequence alignment and the corresponding DNA sequences as input. In contrast to other existing applications, this server is able to construct codon alignments even if the input DNA sequence has mismatches with the input protein sequence, or contains untranslated regions and polyA tails. The server can also deal with frame shifts and inframe stop codons in the input models, and is thus suitable for the analysis of pseudogenes. Another distinct feature is that the user can specify a subregion of the input alignment in order to specifically analyze functional domains or exons of interest. The PAL2NAL server is available at http://www.bork.embl.de/pal2nal.
Microbes are dominant drivers of biogeochemical processes, yet drawing a global picture of functional diversity, microbial community structure, and their ecological determinants remains a grand challenge. We analyzed 7.2 terabases of metagenomic data from 243 Tara Oceans samples from 68 locations in epipelagic and mesopelagic waters across the globe to generate an ocean microbial reference gene catalog with >40 million nonredundant, mostly novel sequences from viruses, prokaryotes, and picoeukaryotes. Using 139 prokaryote-enriched samples, containing >35,000 species, we show vertical stratification with epipelagic community composition mostly driven by temperature rather than other environmental factors or geography. We identify ocean microbial core functionality and reveal that >73% of its abundance is shared with the human gut microbiome despite the physicochemical differences between these two ecosystems.