ϟ

Christian von Mering

Here are all the papers by Christian von Mering that you can download and read on OA.mg.
Christian von Mering’s last known institution is . Download Christian von Mering PDFs here.

Claim this Profile →
DOI: 10.1093/nar/gky1131
2018
Cited 12,420 times
STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.
DOI: 10.1093/nar/gku1003
2014
Cited 8,479 times
STRING v10: protein–protein interaction networks, integrated over the tree of life
The many functional partnerships and interactions that occur between proteins are at the core of cellular processing and their systematic characterization helps to provide context in molecular systems biology. However, known and predicted interactions are scattered over multiple resources, and the available data exhibit notable differences in terms of quality and completeness. The STRING database (http://string-db.org) aims to provide a critical assessment and integration of protein-protein interactions, including direct (physical) as well as indirect (functional) associations. The new version 10.0 of STRING covers more than 2000 organisms, which has necessitated novel, scalable algorithms for transferring interaction information between organisms. For this purpose, we have introduced hierarchical and self-consistent orthology annotations for all interacting proteins, grouping the proteins into families at various levels of phylogenetic resolution. Further improvements in version 10.0 include a completely redesigned prediction pipeline for inferring protein-protein associations from co-expression data, an API interface for the R computing environment and improved statistical analysis for enrichment tests in user-provided networks.
DOI: 10.1093/nar/gkw937
2016
Cited 5,889 times
The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible
A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein-protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein-protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.
DOI: 10.1093/nar/gkaa1074
2020
Cited 4,636 times
The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets
Abstract Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein–protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.
DOI: 10.1093/nar/gks1094
2012
Cited 3,839 times
STRING v9.1: protein-protein interaction networks, with increased coverage and integration
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made—particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.
DOI: 10.1093/nar/gkq973
2010
Cited 3,097 times
The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored
An essential prerequisite for any systems-level understanding of cellular functions is to correctly uncover and annotate all functional interactions among proteins in the cell. Toward this goal, remarkable progress has been made in recent years, both in terms of experimental measurements and computational prediction techniques. However, public efforts to collect and present protein interaction information have struggled to keep up with the pace of interaction discovery, partly because protein–protein interaction information can be error-prone and require considerable effort to annotate. Here, we present an update on the online database resource Search Tool for the Retrieval of Interacting Genes (STRING); it provides uniquely comprehensive coverage and ease of access to both experimental as well as predicted interaction information. Interactions in STRING are provided with a confidence score, and accessory information such as protein domains and 3D structures is made available, all within a stable and consistent identifier space. New features in STRING include an interactive network viewer that can cluster networks on demand, updated on-screen previews of structural information including homology models, extensive data updates and strongly improved connectivity and integration with third-party resources. Version 9.0 of STRING covers more than 1100 completely sequenced organisms; the resource can be reached at http://string-db.org .
DOI: 10.1093/nar/gky1085
2018
Cited 2,636 times
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses
eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de
DOI: 10.1093/nar/gkn760
2009
Cited 2,239 times
STRING 8--a global view on proteins and their functional interactions in 630 organisms
Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.
DOI: 10.1038/nature750
2002
Cited 2,225 times
Comparative assessment of large-scale data sets of protein–protein interactions
DOI: 10.1093/molbev/msx148
2017
Cited 2,052 times
Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper
Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.
DOI: 10.1093/nar/gkg034
2003
Cited 1,974 times
STRING: a database of predicted functional associations between proteins
Functional links between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes that are required for the same function tend to show similar species coverage, are often located in close proximity on the genome (in prokaryotes), and tend to be involved in gene-fusion events.The database STRING is a precomputed global resource for the exploration and analysis of these associations.Since the three types of evidence differ conceptually, and the number of predicted interactions is very large, it is essential to be able to assess and compare the significance of individual predictions.Thus, STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction.The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes.STRING is updated continuously, and currently contains 261 033 orthologs in 89 fully sequenced genomes.The database predicts functional interactions at an expected level of accuracy of at least 80% for more than half of the genes; it
DOI: 10.1093/nar/gkv1248
2015
Cited 1,693 times
eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences
eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de.
DOI: 10.1126/science.1107851
2005
Cited 1,534 times
Comparative Metagenomics of Microbial Communities
The species complexity of microbial communities and challenges in culturing representative isolates make it difficult to obtain assembled genomes. Here we characterize and compare the metabolic capabilities of terrestrial and marine microbial communities using largely unassembled sequence data obtained by shotgun sequencing DNA isolated from the various environments. Quantitative gene content analysis reveals habitat-specific fingerprints that reflect known characteristics of the sampled environments. The identification of environment-specific genes through a gene-centric comparative analysis presents new opportunities for interpreting and diagnosing environments.
DOI: 10.1126/science.1123061
2006
Cited 1,494 times
Toward Automatic Reconstruction of a Highly Resolved Tree of Life
We have developed an automatable procedure for reconstructing the tree of life with branch lengths comparable across all three domains. The tree has its basis in a concatenation of 31 orthologs occurring in 191 species with sequenced genomes. It revealed interdomain discrepancies in taxonomic classification. Systematic detection and subsequent exclusion of products of horizontal gene transfer increased phylogenetic resolution, allowing us to confirm accepted relationships and resolve disputed and preliminary classifications. For example, we place the phylum Acidobacteria as a sister group of δ-Proteobacteria, support a Gram-positive origin of Bacteria, and suggest a thermophilic last universal common ancestor.
DOI: 10.1093/nar/gki005
2004
Cited 1,412 times
STRING: known and predicted protein-protein associations, integrated and transferred across organisms
A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, 'association' can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein-protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730,000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/.
DOI: 10.1093/nar/gkac1000
2022
Cited 1,087 times
The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest
Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.
DOI: 10.1093/nar/gkv1277
2015
Cited 1,077 times
STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data
Interactions between proteins and small molecules are an integral part of biological processes in living organisms. Information on these interactions is dispersed over many databases, texts and prediction methods, which makes it difficult to get a comprehensive overview of the available evidence. To address this, we have developed STITCH ('Search Tool for Interacting Chemicals') that integrates these disparate data sources for 430 000 chemicals into a single, easy-to-use resource. In addition to the increased scope of the database, we have implemented a new network view that gives the user the ability to view binding affinities of chemicals in the interaction network. This enables the user to get a quick overview of the potential effects of the chemical on its interaction partners. For each organism, STITCH provides a global network; however, not all proteins have the same pattern of spatial expression. Therefore, only a certain subset of interactions can occur simultaneously. In the new, fifth release of STITCH, we have implemented functionality to filter out the proteins and chemicals not associated with a given tissue. The STITCH database can be downloaded in full, accessed programmatically via an extensive API, or searched via a redesigned web interface at http://stitch.embl.de.
DOI: 10.1371/journal.pbio.0050244
2007
Cited 935 times
Salmonella enterica Serovar Typhimurium Exploits Inflammation to Compete with the Intestinal Microbiota
Most mucosal surfaces of the mammalian body are colonized by microbial communities (“microbiota”). A high density of commensal microbiota inhabits the intestine and shields from infection (“colonization resistance”). The virulence strategies allowing enteropathogenic bacteria to successfully compete with the microbiota and overcome colonization resistance are poorly understood. Here, we investigated manipulation of the intestinal microbiota by the enteropathogenic bacterium Salmonella enterica subspecies 1 serovar Typhimurium (S. Tm) in a mouse colitis model: we found that inflammatory host responses induced by S. Tm changed microbiota composition and suppressed its growth. In contrast to wild-type S. Tm, an avirulent invGsseD mutant failing to trigger colitis was outcompeted by the microbiota. This competitive defect was reverted if inflammation was provided concomitantly by mixed infection with wild-type S. Tm or in mice (IL10−/−, VILLIN-HACL4-CD8) with inflammatory bowel disease. Thus, inflammation is necessary and sufficient for overcoming colonization resistance. This reveals a new concept in infectious disease: in contrast to current thinking, inflammation is not always detrimental for the pathogen. Triggering the host's immune defence can shift the balance between the protective microbiota and the pathogen in favour of the pathogen.
DOI: 10.1126/science.1077136
2002
Cited 879 times
Immunity-Related Genes and Gene Families in <i>Anopheles gambiae</i>
We have identified 242 Anopheles gambiae genes from 18 gene families implicated in innate immunity and have detected marked diversification relative to Drosophila melanogaster . Immune-related gene families involved in recognition, signal modulation, and effector systems show a marked deficit of orthologs and excessive gene expansions, possibly reflecting selection pressures from different pathogens encountered in these insects' very different life-styles. In contrast, the multifunctional Toll signal transduction pathway is substantially conserved, presumably because of counterselection for developmental stability. Representative expression profiles confirm that sequence diversification is accompanied by specific responses to different immune challenges. Alternative RNA splicing may also contribute to expansion of the immune repertoire.
DOI: 10.1073/pnas.0905240106
2009
Cited 753 times
Community proteogenomics reveals insights into the physiology of phyllosphere bacteria
Aerial plant surfaces represent the largest biological interface on Earth and provide essential services as sites of carbon dioxide fixation, molecular oxygen release, and primary biomass production. Rather than existing as axenic organisms, plants are colonized by microorganisms that affect both their health and growth. To gain insight into the physiology of phyllosphere bacteria under in situ conditions, we performed a culture-independent analysis of the microbiota associated with leaves of soybean, clover, and Arabidopsis thaliana plants using a metaproteogenomic approach. We found a high consistency of the communities on the 3 different plant species, both with respect to the predominant community members (including the alphaproteobacterial genera Sphingomonas and Methylo bacterium ) and with respect to their proteomes. Observed known proteins of Methylobacterium were to a large extent related to the ability of these bacteria to use methanol as a source of carbon and energy. A remarkably high expression of various TonB-dependent receptors was observed for Sphingomonas. Because these outer membrane proteins are involved in transport processes of various carbohydrates, a particularly large substrate utilization pattern for Sphingomonads can be assumed to occur in the phyllosphere. These adaptations at the genus level can be expected to contribute to the success and coexistence of these 2 taxa on plant leaves. We anticipate that our results will form the basis for the identification of unique traits of phyllosphere bacteria, and for uncovering previously unrecorded mechanisms of bacteria-plant and bacteria-bacteria relationships.
DOI: 10.1093/nar/gkm795
2007
Cited 694 times
STITCH: interaction networks of chemicals and proteins
The knowledge about interactions between proteins and small molecules is essential for the understanding of molecular and cellular functions. However, information on such interactions is widely dispersed across numerous databases and the literature. To facilitate access to this data, STITCH ('search tool for interactions of chemicals') integrates information about interactions from metabolic pathways, crystal structures, binding experiments and drug-target relationships. Inferred information from phenotypic effects, text mining and chemical structure similarity is used to predict relations between chemicals. STITCH further allows exploring the network of chemical relations, also in the context of associated binding proteins. Each proposed interaction can be traced back to the original data sources. Our database contains interaction information for over 68,000 different chemicals, including 2200 drugs, and connects them to 1.5 million genes across 373 genomes and their interactions contained in the STRING database. STITCH is available at http://stitch.embl.de/.
DOI: 10.1093/nar/gkl825
2007
Cited 610 times
STRING 7--recent developments in the integration and prediction of protein interactions
Information on protein-protein interactions is still mostly limited to a small number of model organisms, and originates from a wide variety of experimental and computational techniques. The database and online resource STRING generalizes access to protein interaction data, by integrating known and predicted interactions from a variety of sources. The underlying infrastructure includes a consistent body of completely sequenced genomes and exhaustive orthology classifications, based on which interaction evidence is transferred between organisms. Although primarily developed for protein interaction analysis, the resource has also been successfully applied to comparative genomics, phylogenetics and network studies, which are all facilitated by programmatic access to the database backend and the availability of compact download files. As of release 7, STRING has almost doubled to 373 distinct organisms, and contains more than 1.5 million proteins for which associations have been pre-computed. Novel features include AJAX-based web-navigation, inclusion of additional resources such as BioGRID, and detailed protein domain annotation. STRING is available at http://string.embl.de/
DOI: 10.1038/nbt926
2004
Cited 599 times
The HUPO PSI's Molecular Interaction format—a community standard for the representation of protein interaction data
DOI: 10.1038/ismej.2011.192
2011
Cited 597 times
Metaproteogenomic analysis of microbial communities in the phyllosphere and rhizosphere of rice
The above- and below-ground parts of rice plants create specific habitats for various microorganisms. In this study, we characterized the phyllosphere and rhizosphere microbiota of rice cultivars using a metaproteogenomic approach to get insight into the physiology of the bacteria and archaea that live in association with rice. The metaproteomic datasets gave rise to a total of about 4600 identified proteins and indicated the presence of one-carbon conversion processes in the rhizosphere as well as in the phyllosphere. Proteins involved in methanogenesis and methanotrophy were found in the rhizosphere, whereas methanol-based methylotrophy linked to the genus Methylobacterium dominated within the protein repertoire of the phyllosphere microbiota. Further, physiological traits of differential importance in phyllosphere versus rhizosphere bacteria included transport processes and stress responses, which were more conspicuous in the phyllosphere samples. In contrast, dinitrogenase reductase was exclusively identified in the rhizosphere, despite the presence of nifH genes also in diverse phyllosphere bacteria.
DOI: 10.1126/science.1077061
2002
Cited 530 times
Comparative Genome and Proteome Analysis of <i>Anopheles gambiae</i> and <i>Drosophila melanogaster</i>
Comparison of the genomes and proteomes of the two diptera Anopheles gambiae and Drosophila melanogaster, which diverged about 250 million years ago, reveals considerable similarities. However, numerous differences are also observed; some of these must reflect the selection and subsequent adaptation associated with different ecologies and life strategies. Almost half of the genes in both genomes are interpreted as orthologs and show an average sequence identity of about 56%, which is slightly lower than that observed between the orthologs of the pufferfish and human (diverged about 450 million years ago). This indicates that these two insects diverged considerably faster than vertebrates. Aligned sequences reveal that orthologous genes have retained only half of their intron/exon structure, indicating that intron gains or losses have occurred at a rate of about one per gene per 125 million years. Chromosomal arms exhibit significant remnants of homology between the two species, although only 34% of the genes colocalize in small "microsyntenic" clusters, and major interarm transfers as well as intra-arm shuffling of gene order are detected.
DOI: 10.1093/nar/gkt1253
2013
Cited 511 times
eggNOG v4.0: nested orthology inference across 3686 organisms
With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
DOI: 10.1016/s0092-8674(00)81572-x
1998
Cited 509 times
Expression of Amino-Terminally Truncated PrP in the Mouse Leading to Ataxia and Specific Cerebellar Lesions
The physiological role of prion protein (PrP) remains unknown. Mice devoid of PrP develop normally but are resistant to scrapie; introduction of a PrP transgene restores susceptibility to the disease. To identify the regions of PrP necessary for this activity, we prepared PrP knockout mice expressing PrPs with amino-proximal deletions. Surprisingly, PrP lacking residues 32-121 or 32-134, but not with shorter deletions, caused severe ataxia and neuronal death limited to the granular layer of the cerebellum as early as 1-3 months after birth. The defect was completely abolished by introducing one copy of a wild-type PrP gene. We speculate that these truncated PrPs may be nonfunctional and compete with some other molecule with a PrP-like function for a common ligand.
DOI: 10.1002/pmic.201400441
2015
Cited 501 times
Version 4.0 of PaxDb: Protein abundance data, integrated across model organisms, tissues, and cell‐lines
Protein quantification at proteome-wide scale is an important aim, enabling insights into fundamental cellular biology and serving to constrain experiments and theoretical models. While proteome-wide quantification is not yet fully routine, many datasets approaching proteome-wide coverage are becoming available through biophysical and MS techniques. Data of this type can be accessed via a variety of sources, including publication supplements and online data repositories. However, access to the data is still fragmentary, and comparisons across experiments and organisms are not straightforward. Here, we describe recent updates to our database resource "PaxDb" (Protein Abundances Across Organisms). PaxDb focuses on protein abundance information at proteome-wide scope, irrespective of the underlying measurement technique. Quantification data is reprocessed, unified, and quality-scored, and then integrated to build a meta-resource. PaxDb also allows evolutionary comparisons through precomputed gene orthology relations. Recently, we have expanded the scope of the database to include cell-line samples, and more systematically scan the literature for suitable datasets. We report that a significant fraction of published experiments cannot readily be accessed and/or parsed for quantitative information, requiring additional steps and efforts. The current update brings PaxDb to 414 datasets in 53 organisms, with (semi-) quantitative abundance information covering more than 300,000 proteins.
DOI: 10.1038/ng.2906
2014
Cited 489 times
Pathogens and host immunity in the ancient human oral cavity
Calcified dental plaque (dental calculus) preserves for millennia and entraps biomolecules from all domains of life and viruses. We report the first, to our knowledge, high-resolution taxonomic and protein functional characterization of the ancient oral microbiome and demonstrate that the oral cavity has long served as a reservoir for bacteria implicated in both local and systemic disease. We characterize (i) the ancient oral microbiome in a diseased state, (ii) 40 opportunistic pathogens, (iii) ancient human-associated putative antibiotic resistance genes, (iv) a genome reconstruction of the periodontal pathogen Tannerella forsythia, (v) 239 bacterial and 43 human proteins, allowing confirmation of a long-term association between host immune factors, 'red complex' pathogens and periodontal disease, and (vi) DNA sequences matching dietary sources. Directly datable and nearly ubiquitous, dental calculus permits the simultaneous investigation of pathogen activity, host immunity and diet, thereby extending direct investigation of common diseases into the human evolutionary past.
DOI: 10.1093/nar/gkr1060
2011
Cited 469 times
eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges
Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%).
DOI: 10.1093/nar/gkm796
2007
Cited 445 times
eggNOG: automated construction and annotation of orthologous groups of genes
The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database ('evolutionary genealogy of genes: Non-supervised Orthologous Groups'), which contains orthologous groups constructed from Smith-Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de.
DOI: 10.1038/s41586-020-1965-x
2020
Cited 432 times
Analyses of non-coding somatic drivers in 2,658 cancer whole genomes
Abstract The discovery of drivers of cancer has traditionally focused on protein-coding genes 1–4 . Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium 5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers 6,7 , raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53 , in the 3′ untranslated regions of NFKBIZ and TOB1 , focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.
DOI: 10.1074/mcp.o111.014704
2012
Cited 430 times
PaxDb, a Database of Protein Abundance Averages Across All Three Domains of Life
Although protein expression is regulated both temporally and spatially, most proteins have an intrinsic, "typical" range of functionally effective abundance levels. These extend from a few molecules per cell for signaling proteins, to millions of molecules for structural proteins. When addressing fundamental questions related to protein evolution, translation and folding, but also in routine laboratory work, a simple rough estimate of the average wild type abundance of each detectable protein in an organism is often desirable. Here, we introduce a meta-resource dedicated to integrating information on absolute protein abundance levels; we place particular emphasis on deep coverage, consistent post-processing and comparability across different organisms. Publicly available experimental data are mapped onto a common namespace and, in the case of tandem mass spectrometry data, re-processed using a standardized spectral counting pipeline. By aggregating and averaging over the various samples, conditions and cell-types, the resulting integrated data set achieves increased coverage and a high dynamic range. We score and rank each contributing, individual data set by assessing its consistency against externally provided protein-network information, and demonstrate that our weighted integration exhibits more consistency than the data sets individually. The current PaxDb-release 2.1 (at http://pax-db.org/) presents whole-organism data as well as tissue-resolved data, and covers 85,000 proteins in 12 model organisms. All values can be seamlessly compared across organisms via pre-computed orthology relationships.
DOI: 10.1101/gr.104521.109
2010
Cited 424 times
A global network of coexisting microbes from environmental and whole-genome sequence data
Microbes are the most abundant and diverse organisms on Earth. In contrast to macroscopic organisms, their environmental preferences and ecological interdependencies remain difficult to assess, requiring laborious molecular surveys at diverse sampling sites. Here, we present a global meta-analysis of previously sampled microbial lineages in the environment. We grouped publicly available 16S ribosomal RNA sequences into operational taxonomic units at various levels of resolution and systematically searched these for co-occurrence across environments. Naturally occurring microbes, indeed, exhibited numerous, significant interlineage associations. These ranged from relatively specific groupings encompassing only a few lineages, to larger assemblages of microbes with shared habitat preferences. Many of the coexisting lineages were phylogenetically closely related, but a significant number of distant associations were observed as well. The increased availability of completely sequenced genomes allowed us, for the first time, to search for genomic correlates of such ecological associations. Genomes from coexisting microbes tended to be more similar than expected by chance, both with respect to pathway content and genome size, and outliers from these trends are discussed. We hypothesize that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.
DOI: 10.1093/nar/gkt1207
2013
Cited 373 times
STITCH 4: integration of protein–chemical interactions with user data
STITCH is a database of protein-chemical interactions that integrates many sources of experimental and manually curated evidence with text-mining information and interaction predictions.Available at http://stitch.embl.de, the resulting interaction network includes 390 000 chemicals and 3.6 million proteins from 1133 organisms.Compared with the previous version, the number of high-confidence protein-chemical interactions in human has increased by 45%, to 367 000.In this version, we added features for users to upload their own data to STITCH in the form of internal identifiers, chemical structures or quantitative data.For example, a user can now upload a spreadsheet with screening hits to easily check which interactions are already known.To increase the coverage of STITCH, we expanded the text mining to include full-text articles and added a prediction method based on chemical structures.We further changed our scheme for transferring interactions between species to rely on orthology rather than protein similarity.This improves the performance within protein families, where scores are now transferred only to orthologous proteins, but not to paralogous proteins.STITCH can be accessed with a web-interface, an API and downloadable files.
DOI: 10.1371/journal.ppat.1000711
2010
Cited 370 times
Like Will to Like: Abundances of Closely Related Species Can Predict Susceptibility to Intestinal Colonization by Pathogenic and Commensal Bacteria
The intestinal ecosystem is formed by a complex, yet highly characteristic microbial community. The parameters defining whether this community permits invasion of a new bacterial species are unclear. In particular, inhibition of enteropathogen infection by the gut microbiota ( = colonization resistance) is poorly understood. To analyze the mechanisms of microbiota-mediated protection from Salmonella enterica induced enterocolitis, we used a mouse infection model and large scale high-throughput pyrosequencing. In contrast to conventional mice (CON), mice with a gut microbiota of low complexity (LCM) were highly susceptible to S. enterica induced colonization and enterocolitis. Colonization resistance was partially restored in LCM-animals by co-housing with conventional mice for 21 days (LCM(con21)). 16S rRNA sequence analysis comparing LCM, LCM(con21) and CON gut microbiota revealed that gut microbiota complexity increased upon conventionalization and correlated with increased resistance to S. enterica infection. Comparative microbiota analysis of mice with varying degrees of colonization resistance allowed us to identify intestinal ecosystem characteristics associated with susceptibility to S. enterica infection. Moreover, this system enabled us to gain further insights into the general principles of gut ecosystem invasion by non-pathogenic, commensal bacteria. Mice harboring high commensal E. coli densities were more susceptible to S. enterica induced gut inflammation. Similarly, mice with high titers of Lactobacilli were more efficiently colonized by a commensal Lactobacillus reuteri(RR) strain after oral inoculation. Upon examination of 16S rRNA sequence data from 9 CON mice we found that closely related phylotypes generally display significantly correlated abundances (co-occurrence), more so than distantly related phylotypes. Thus, in essence, the presence of closely related species can increase the chance of invasion of newly incoming species into the gut ecosystem. We provide evidence that this principle might be of general validity for invasion of bacteria in preformed gut ecosystems. This might be of relevance for human enteropathogen infections as well as therapeutic use of probiotic commensal bacteria.
DOI: 10.1126/science.aai7825
2017
Cited 329 times
Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability
How proteomes take the heat Living organisms are very sensitive to temperature, and much of this is attributed to its effect on the structure and function of proteins. Leuenberger et al. explored thermostability on a proteome-wide scale in bacteria, yeast, and human cells by using a combination of limited proteolysis and mass spectrometry (see the Perspective by Vogel). Their results suggest that temperature-induced cell death is caused by the loss of a subset of proteins with key functions. The study also provides insight into the molecular and evolutionary bases of protein and proteome stability. Science , this issue p. eaai7825 ; see also p. 794
DOI: 10.1126/science.1133420
2007
Cited 327 times
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments
The taxonomic composition of environmental communities is an important indicator of their ecology and function. We used a set of protein-coding marker genes, extracted from large-scale environmental shotgun sequencing data, to provide a more direct, quantitative, and accurate picture of community composition than that provided by traditional ribosomal RNA-based approaches depending on the polymerase chain reaction. Mapping marker genes from four diverse environmental data sets onto a reference species phylogeny shows that certain communities evolve faster than others. The method also enables determination of preferred habitats for entire microbial clades and provides evidence that such habitat preferences are often remarkably stable over time.
DOI: 10.1016/j.sbi.2004.05.003
2004
Cited 326 times
Protein interaction networks from yeast to human
Protein interaction networks summarize large amounts of protein–protein interaction data, both from individual, small-scale experiments and from automated high-throughput screens. The past year has seen a flood of new experimental data, especially on metazoans, as well as an increasing number of analyses designed to reveal aspects of network topology, modularity and evolution. As only minimal progress has been made in mapping the human proteome using high-throughput screens, the transfer of interaction information within and across species has become increasingly important. With more and more heterogeneous raw data becoming available, proper data integration and quality control have become essential for reliable protein network reconstruction, and will be especially important for reconstructing the human protein interaction network.
DOI: 10.1371/journal.ppat.1001097
2010
Cited 303 times
The Microbiota Mediates Pathogen Clearance from the Gut Lumen after Non-Typhoidal Salmonella Diarrhea
Many enteropathogenic bacteria target the mammalian gut. The mechanisms protecting the host from infection are poorly understood. We have studied the protective functions of secretory antibodies (sIgA) and the microbiota, using a mouse model for S. typhimurium diarrhea. This pathogen is a common cause of diarrhea in humans world-wide. S. typhimurium (S. tmatt, sseD) causes a self-limiting gut infection in streptomycin-treated mice. After 40 days, all animals had overcome the disease, developed a sIgA response, and most had cleared the pathogen from the gut lumen. sIgA limited pathogen access to the mucosal surface and protected from gut inflammation in challenge infections. This protection was O-antigen specific, as demonstrated with pathogens lacking the S. typhimurium O-antigen (wbaP, S. enteritidis) and sIgA-deficient mice (TCRβ−/−δ−/−, JH−/−, IgA−/−, pIgR−/−). Surprisingly, sIgA-deficiency did not affect the kinetics of pathogen clearance from the gut lumen. Instead, this was mediated by the microbiota. This was confirmed using 'L-mice' which harbor a low complexity gut flora, lack colonization resistance and develop a normal sIgA response, but fail to clear S. tmatt from the gut lumen. In these mice, pathogen clearance was achieved by transferring a normal complex microbiota. Thus, besides colonization resistance ( = pathogen blockage by an intact microbiota), the microbiota mediates a second, novel protective function, i.e. pathogen clearance. Here, the normal microbiota re-grows from a state of depletion and disturbed composition and gradually clears even very high pathogen loads from the gut lumen, a site inaccessible to most "classical" immune effector mechanisms. In conclusion, sIgA and microbiota serve complementary protective functions. The microbiota confers colonization resistance and mediates pathogen clearance in primary infections, while sIgA protects from disease if the host re-encounters the same pathogen. This has implications for curing S. typhimurium diarrhea and for preventing transmission.
DOI: 10.1016/j.cub.2010.01.051
2010
Cited 283 times
Arabidopsis Female Gametophyte Gene Expression Map Reveals Similarities between Plant and Animal Gametes
The development of multicellular organisms is controlled by differential gene expression whereby cells adopt distinct fates. A spatially resolved view of gene expression allows the elucidation of transcriptional networks that are linked to cellular identity and function. The haploid female gametophyte of flowering plants is a highly reduced organism: at maturity, it often consists of as few as three cell types derived from a common precursor [1, 2]. However, because of its inaccessibility and small size, we know little about the molecular basis of cell specification and differentiation in the female gametophyte. Here we report expression profiles of all cell types in the mature Arabidopsis female gametophyte. Differentially expressed posttranscriptional regulatory modules and metabolic pathways characterize the distinct cell types. Several transcription factor families are overrepresented in the female gametophyte in comparison to other plant tissues, e.g., type I MADS domain, RWP-RK, and reproductive meristem transcription factors. PAZ/Piwi-domain encoding genes are upregulated in the egg, indicating a role of epigenetic regulation through small RNA pathways-a feature paralleled in the germline of animals [3]. A comparison of human and Arabidopsis egg cells for enrichment of functional groups identified several similarities that may represent a consequence of coevolution or ancestral gametic features.
DOI: 10.1126/scisignal.2001182
2010
Cited 275 times
Phosphoproteomic Analysis Reveals Interconnected System-Wide Responses to Perturbations of Kinases and Phosphatases in Yeast
A system-wide analysis of protein phosphorylation in yeast reveals robustness in the network of kinases and phosphatases.
DOI: 10.1186/gb-2007-8-1-r10
2007
Cited 270 times
Prediction of effective genome size in metagenomic samples.
We introduce a novel computational approach to predict effective genome size (EGS; a measure that includes multiple plasmid copies, inserted sequences, and associated phages and viruses) from short sequencing reads of environmental genomics (or metagenomics) projects. We observe considerable EGS differences between environments and link this with ecologic complexity as well as species composition (for instance, the presence of eukaryotes). For example, we estimate EGS in a complex, organism-dense farm soil sample at about 6.3 megabases (Mb) whereas that of the bacteria therein is only 4.7 Mb; for bacteria in a nutrient-poor, organism-sparse ocean surface water sample, EGS is as low as 1.6 Mb. The method also permits evaluation of completion status and assembly bias in single-genome sequencing projects.
DOI: 10.1038/nrd.2018.14
2018
Cited 262 times
Unexplored therapeutic opportunities in the human genome
In 2014, the Illuminating the Druggable Genome programme was launched to promote the exploration of currently understudied but potentially druggable proteins. This article discusses how the systematic collection and processing of a wide array of biological and chemical data as part of this programme has enabled the development of evidence-based criteria for tracking the target development level of human proteins, which indicates a substantial knowledge deficit for approximately one out of three proteins in the human proteome. It also highlights the nature of the unexplored therapeutic opportunities for major protein families. A large proportion of biomedical research and the development of therapeutics is focused on a small fraction of the human genome. In a strategic effort to map the knowledge gaps around proteins encoded by the human genome and to promote the exploration of currently understudied, but potentially druggable, proteins, the US National Institutes of Health launched the Illuminating the Druggable Genome (IDG) initiative in 2014. In this article, we discuss how the systematic collection and processing of a wide array of genomic, proteomic, chemical and disease-related resource data by the IDG Knowledge Management Center have enabled the development of evidence-based criteria for tracking the target development level (TDL) of human proteins, which indicates a substantial knowledge deficit for approximately one out of three proteins in the human proteome. We then present spotlights on the TDL categories as well as key drug target classes, including G protein-coupled receptors, protein kinases and ion channels, which illustrate the nature of the unexplored opportunities for biomedical research and therapeutic development.
DOI: 10.1093/nar/gkr1011
2011
Cited 251 times
STITCH 3: zooming in on protein-chemical interactions
To facilitate the study of interactions between proteins and chemicals, we have created STITCH, an aggregated database of interactions connecting over 300,000 chemicals and 2.6 million proteins from 1133 organisms. Compared to the previous version, the number of chemicals with interactions and the number of high-confidence interactions both increase 4-fold. The database can be accessed interactively through a web interface, displaying interactions in an integrated network view. It is also available for computational studies through downloadable files and an API. As an extension in the current version, we offer the option to switch between two levels of detail, namely whether stereoisomers of a given compound are shown as a merged entity or as separate entities. Separate display of stereoisomers is necessary, for example, for carbohydrates and chiral drugs. Combining the isomers increases the coverage, as interaction databases and publications found through text mining will often refer to compounds without specifying the stereoisomer. The database is accessible at http://stitch.embl.de/.
DOI: 10.1093/nar/gkab835
2021
Cited 240 times
Correction to ‘The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets’
DOI: 10.1038/nmeth.3830
2016
Cited 194 times
Standardized benchmarking in the quest for orthologs
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
DOI: 10.1038/nrd.2018.52
2018
Cited 188 times
Unexplored therapeutic opportunities in the human genome.
DOI: 10.1038/s42003-019-0741-7
2020
Cited 146 times
Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis
Abstract Long non-coding RNAs (lncRNAs) are a growing focus of cancer genomics studies, creating the need for a resource of lncRNAs with validated cancer roles. Furthermore, it remains debated whether mutated lncRNAs can drive tumorigenesis, and whether such functions could be conserved during evolution. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we introduce the Cancer LncRNA Census (CLC), a compilation of 122 GENCODE lncRNAs with causal roles in cancer phenotypes. In contrast to existing databases, CLC requires strong functional or genetic evidence. CLC genes are enriched amongst driver genes predicted from somatic mutations, and display characteristic genomic features. Strikingly, CLC genes are enriched for driver mutations from unbiased, genome-wide transposon-mutagenesis screens in mice. We identified 10 tumour-causing mutations in orthologues of 8 lncRNAs, including LINC-PINT and NEAT1 , but not MALAT1 . Thus CLC represents a dataset of high-confidence cancer lncRNAs. Mutagenesis maps are a novel means for identifying deeply-conserved roles of lncRNAs in tumorigenesis.
DOI: 10.1093/nar/gkac1022
2022
Cited 67 times
eggNOG 6.0: enabling comparative genomics across 12 535 organisms
The eggNOG (evolutionary gene genealogy Non-supervised Orthologous Groups) database is a bioinformatics resource providing orthology data and comprehensive functional information for organisms from all domains of life. Here, we present a major update of the database and website (version 6.0), which increases the number of covered organisms to 12 535 reference species, expands functional annotations, and implements new functionality. In total, eggNOG 6.0 provides a hierarchy of over 17M orthologous groups (OGs) computed at 1601 taxonomic levels, spanning 10 756 bacterial, 457 archaeal and 1322 eukaryotic organisms. OGs have been thoroughly annotated using recent knowledge from functional databases, including KEGG, Gene Ontology, UniProtKB, BiGG, CAZy, CARD, PFAM and SMART. eggNOG also offers phylogenetic trees for all OGs, maximising utility and versatility for end users while allowing researchers to investigate the evolutionary history of speciation and duplication events as well as the phylogenetic distribution of functional terms within each OG. Furthermore, the eggNOG 6.0 website contains new functionality to mine orthology and functional data with ease, including the possibility of generating phylogenetic profiles for multiple OGs across species or identifying single-copy OGs at custom taxonomic levels. eggNOG 6.0 is available at http://eggnog6.embl.de.
2004
Cited 277 times
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
DOI: 10.1038/sj.embor.7400538
2005
Cited 270 times
Environments shape the nucleotide composition of genomes
Scientific Report1 December 2005free access Environments shape the nucleotide composition of genomes Konrad U Foerstner Konrad U Foerstner European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Christian von Mering Christian von Mering European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Sean D Hooper Sean D Hooper European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Peer Bork Corresponding Author Peer Bork European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Max-Delbrück Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany Search for more papers by this author Konrad U Foerstner Konrad U Foerstner European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Christian von Mering Christian von Mering European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Sean D Hooper Sean D Hooper European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Search for more papers by this author Peer Bork Corresponding Author Peer Bork European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Max-Delbrück Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany Search for more papers by this author Author Information Konrad U Foerstner1, Christian von Mering1, Sean D Hooper1 and Peer Bork 1,2 1European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany 2Max-Delbrück Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13092 Berlin, Germany *Corresponding author. Tel: +49 6221 387 8526; Fax: +49 6221 387 517; E-mail: [email protected] EMBO Reports (2005)6:1208-1213https://doi.org/10.1038/sj.embor.7400538 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info To test the impact of environments on genome evolution, we analysed the relative abundance of the nucleotides guanine and cytosine (‘GC content’) of large numbers of sequences from four distinct environmental samples (ocean surface water, farm soil, an acidophilic mine drainage biofilm and deep-sea whale carcasses). We show that the GC content of complex microbial communities seems to be globally and actively influenced by the environment. The observed nucleotide compositions cannot be easily explained by distinct phylogenetic origins of the species in the environments; the genomic GC content may change faster than was previously thought, and is also reflected in the amino-acid composition of the proteins in these habitats. Introduction The relative abundance of the nucleotides guanine and cytosine (‘GC content’) varies widely between genomes of different species and even between entire phyla (Sueoka, 1962). However, it is unclear whether this is due to intrinsic, organism-specific mechanisms or external factors, and whether it is the result of neutral processes or selection. Several hypotheses have been put forward to explain variations in the GC content of organisms, some of which are controversial (discussed by Bentley & Parkhill, 2004). These hypotheses are often based on observed, simple correlations of GC content with another (intrinsic or extrinsic) measure. One of the intrinsic correlations is a tendency of large genomes to be GC rich and small genomes to be GC poor (Heddi et al, 1998; Moran, 2002; Rocha & Danchin, 2002). Because large genomes are presumably found in more complex, variable environments, there could be an indirect link between GC content and niche complexity. One possible reason for this is the higher cost of synthesis of ATP than of UDP (in complex environments, growth and ATP synthesis are presumed to be slower). The need for being able to quickly mobilize ATP may also have a role in the case of small genomes (Rocha & Danchin, 2002). As random mutations of DNA are mainly the conversion from C to T and from G to A, the lack of repair mechanisms in reduced genomes could also be a reason for small genomes being AT rich (Glass et al, 2000). Another factor could be the preferred growth temperature of an organism, which has been proposed to correlate with GC content (Musto et al, 2004), but this is under debate (Marashi & Ghalanbor, 2004; Musto et al, 2005). Growth temperature is known to correlate with polypurine (AG) tracts in messenger RNAs (Lobry & Chessel, 2003; Paz et al, 2004). Although this alone does not preclude a correlation with GC, it would disfavour extreme GC levels in thermophilic organisms. It has been observed that genomes of some nitrogen-fixing organisms contain a higher fraction of guanine and cytosine than the genomes of nonfixing species of the same genus (McEwan et al, 1998). Likewise, Naya et al (2002) put forward a connection between an aerobic lifestyle and an increased GC content. However, most of the above correlations are not very strong, and could obviously be merely indirect consequences of other, as yet unknown, factors that influence genomic GC content more directly. Another complication is that, so far, the field has focused on available genome sequences, which are derived from single isolates from a wide variety of environments. This has precluded the analysis of community effects (in natural settings, microbes may live in large communities of hundreds or thousands of different species), and of global influences of the environment. In addition, it neglects the large fraction of environmental microbes that resist cultivation in the laboratory (Staley & Konopka, 1985). Only recently, random shotgun sequence data from environmental DNA preparations have become available, allowing an unbiased view on the genomic characteristics of an entire environmental community. Here we show, using the large-scale data from Sargasso Sea surface water (Venter et al, 2004), from a biofilm in an underground acid drainage mine (Tyson et al, 2004) as well as from farm soil and deep-sea whale carcasses (Tringe et al, 2005), that the environment indeed has a considerable impact on GC content and implicitly also on the amino-acid composition of the proteins in a habitat. Results Unexpected GC-content distributions in environments To obtain a representative, quantitative estimate of the environmental GC-content distribution, raw sequencing reads were analysed (not the assembled contigs). However, analysis of raw sequencing reads may generate some inaccuracies, as they can contain regions of poor sequencing quality. Therefore, consistency checks of increasing stringency were executed, invariably confirming the initial GC-content distributions, whether by limiting the analysis to open reading frames with clear homology, or even to a restricted set of translation-related marker genes (see Methods for details). Owing to the large amounts of DNA (numerous independent reads totalling more than 100 Mbp for each of the four habitats), the GC-content patterns are very robust, and (sub)samples from similar environments tend to have similar GC-content patterns (Fig 1A). Surprisingly, the samples from farm soil and ocean surface water—both of which contain DNA from more than 1,000 diverse, non-abundant species (Venter et al, 2004; Tringe et al, 2005)—are very different, with the surface water sample having a GC-content median of around 34% and the soil sample around 61%. To test whether these differences are simply the result of distinct phylogenetic compositions of the samples, we estimated the GC-content distribution that the environments were expected to have, on the basis of the known abundances of the various phyla and the GC content of previously known genomes from these phyla. Both water and soil samples deviated strongly from expectations (Fig 1B; supplementary Fig 1 online; expected distributions were estimated by re-creating the communities from known genomes and matching the reported phylogenetic compositions). Strikingly, the GC content in these two complex environments is more narrowly distributed than that of most bacterial phyla, which is unexpected as the environments contain species from many phyla and should therefore have an even broader distribution than the 162 completely sequenced genomes known today (see bottom of Fig 1A for comparison). In addition, we observe that GC-content differences exist even for closely related sequences (Fig 2B), suggesting an active, continuing process. Figure 1.Guanine and cytosine content of environmental sequences. Guanine and cytosine content distributions and predicted frequencies of amino acids in four environments (eight subsamples in total, all containing >90% prokaryotic species), compared with completely sequenced prokaryotic genomes grouped into phyla and subphyla. The trees depict the relationships between the samples (Tringe et al, 2005), and between phyla and subphyla to which the genomes belong. The number of sequenced genomes available for each taxonomic group is given in parentheses. Only phyla with at least three completely sequenced genomes have been included, and only those environmental sequence fragments that contain at least one predicted open reading frame with significant similarity to a known gene (60 bits or better) are shown. (A) Relative distributions of Guanine and cytosine (GC) content values, averaged over individual sequence reads. For comparability, virtual reads were generated for completely sequenced genomes. The darker the colour, the higher the number of reads with the respective GC content. Vertical dashed lines denote the average value of each sample/group. (B) Comparison of the GC distribution of Sargasso Sea reads (subsamples #2–#4) with (i) a subset that contains only translation genes occurring once per genome and (ii) with a simulated sample derived from completely sequenced genomes and selected to contain the same distribution of phyla. Translation genes show a distribution similar to the whole set, indicating that no bias is introduced by gene content (larger genomes may contain many genes with unusual GC content); the deviation from the simulated sample shows that GC content is apparently not always a simple function of the broad phylogenetic distribution of the species in an environment. (C) Frequencies of the amino acids lysine and alanine among encoded proteins. Notice the dependency on GC content (for other amino acids, as well as a compound index, see supplementary Table 1 online). Download figure Download PowerPoint Figure 2.Guanine and cytosine content analysis of open reading frames. (A) Deviation from expectation. Guanine and cytosine (GC) content distributions are shown for each environmental sample, separately for each codon position. The curves are compared with the expected distributions; the latter were derived from known genomes by sampling their DNA in amounts matching the overall phylogenetic compositions reported for the samples. (B) GC-content differences for paired open reading frames (ORFs) of high sequence similarity (that is, recent divergence). ORFs were paired on the basis of reciprocal best matches in BLAST searches (see supplementary Figure 3 online for more details). Error bars denote 90% confidence intervals of the mean. (C) Phylogenetic distributions of organisms, as reported from 16S ribosomal RNA analysis, for two principal samples. Note the wide range of phyla present. ac, Actinobacteria; ad, Acidobacteria; ap, α-Proteobacteria; ba, Bacteriodetes; bp, β-Proteobacteria; cb, Chlorobi; ch, Chloroflexi; cr, Crenarchaeota; cy, Cyanobacteria; de, Deinococcus-Thermus; dp, δ-Proteobacteria; ep, ε-Proteobacteria; er, Eryarchaeota; fi, Firmicutes; fu, Fusobacteria; ge, Gemmatimonadetes; gp, γ-Proteobacteria; ni, Nitrospira; ot, others; pl, Planctomycetes; sp, Spirochaetes. Download figure Download PowerPoint The above trends are weaker for the acidic biofilm and the whale carcasses, but these environments are much younger (far less than 100 years old; Tyson et al, 2004; Tringe et al, 2005), and seem to contain only a few species. Unconstrained nucleotides show the largest differences To avoid possible biases due to habitat-specific, perhaps unusual, features of non-coding DNA and to measure functional constraints, we restricted the analysis to the open reading frames themselves (of length 150 codons or longer; Fickett, 1995), and analysed the GC-content distribution separately for each of the three codon positions. We found that the third codon position is even more extreme with respect to GC distribution than the average of all three positions (Fig 2, the median in farm soil is 74%, versus 24% in the ocean surface water). The third codon position is relatively free to evolve (owing to the degeneracy of the genetic code), and its extreme GC-content distribution suggests that the process that drives GC-content changes is (at least to some extent) kept in check by coding requirements. Global differences in amino-acid usage in proteins The overall frequencies of the various amino acids in encoded proteins are known to vary with changes in overall GC content in microbial genomes (Sueoka, 1961). To confirm and assess this dependency in the case of environmental communities, we globally counted amino acids in predicted proteins, and computed the relative fraction of each amino acid in the various samples (Fig 1C; supplementary Table 1 online). The following amino acids are encoded by AT-rich codons, and are thus expected to be over-represented in low-GC environments: F, Y, M, I, N and K. Conversely, the following amino acids are expected to be over-represented in high-GC environments: G, A, R and P. The abundance ratio of the two groups (the so-called ‘FYMINK/GARP’ index; Foster et al, 1997) correlates inversely with overall GC content, as expected (supplementary Table 1 online). Discussion Environmental microbial communities seem to show distinct, and unexpectedly narrow, GC-content distributions. The observed GC patterns are not simply a result of differing species compositions in each environment, as simulations of these compositions using sequenced genomes with the same phylogenetic distribution results in distinct GC patterns (see Fig 1B for a striking example; also see supplementary Fig 1 online). Even closely related sequences, when they are from different environments, show marked differences in GC content, more so than when they are from the same environment (Fig 2B). We can exclude an impact of certain enriched gene families, because the differences remain when the analysis is restricted to a set of essential genes that occur only once per genome and are present in each environment (Fig 1B; supplementary Fig 1 online). However, we cannot completely rule out effects due to differences in experimental protocols (such as DNA preparation or cloning). A weak correlation between genome size and GC content (Moran, 2002; supplementary Fig 2 online) might reflect one possible environmental impact: genomes in ocean surface water are smaller than in soil (Venter et al, 2004; Tringe et al, 2005). In any case, the narrow distributions of the GC content in complex habitats indicate that mainly external environmental factors influence the GC nucleotide composition of a community, either selectively or by causing a directed, mechanistic mutational bias. These factors have to be more global than the previously suggested lifestyle influences (Bentley & Parkhill, 2004), such as the use of oxygen as an energy source (Naya et al, 2002), the ability to fix nitrogen (McEwan et al, 1998) or differences in effective population size (Moran, 1996; also see supplementary Fig 4 online). One possibility would be ultraviolet irradiation, which is particularly high in surface water, to the extent that it influences bacterioplankton productivity (Herndl et al, 1993). Whatever is causing the differences in GC content, it could either actively change the GC content of the existing organisms in an environment, or alternatively, it could limit the type of microbes that can successfully populate an environment in the first place. Genome-wide changes of GC content are thought to occur on relatively slow timescales—1% of change in CG content is projected to require about 3 Mio years (Haywood-Farmer & Otto, 2003). In contrast, microbial communities are presumably broken up and re-assembled on much shorter timescales (open oceans, for example, have strong water currents—with global ocean mixing occurring fast, in only a few centuries; Stuiver et al, 1984). This would argue that community GC-content patterns originate at the time of community assembly, by selective pressures restricting the set of appropriate organisms from a larger pool of available organisms. Supporting this, we observe (in all environments tested) that the distribution of GC content is much more narrow than the GC content of a simple, unbiased mix of all prokaryotes known at present (Fig 1A). The observed GC-content differences have a direct impact on the amino-acid composition of proteins in the respective environments (Fig 1C), a correlation (Sueoka, 1961) that is well established for individual genomes (Bharanidharan et al, 2004), and that can now be extended to the genetic material of whole communities. GC-rich communities contain more amino acids encoded by GC-rich codons, whereas the opposite is true for GC-poor communities (Fig 1C; supplementary Table 1 online). Considering the relatively young age of any given microbial community, it seems that the local amino-acid usage fluctuates rapidly, complementary to a drift at evolutionary timescales that has been observed recently (Jordan et al, 2005). Methods Data. At the time of this study, four distinct environments had been analysed through cultivation-independent, large-scale DNA shotgun sequencing (‘large scale’ being arbitrarily defined as more than 100 Mbp of raw sequence): surface sea water from the Sargasso Sea (Venter et al, 2004); a pair of deep-sea whale carcasses (‘whale fall’) from distinct geographic locations (Tringe et al, 2005); an acidophilic biofilm from an underground mine drainage flow (Tyson et al, 2004); and agricultural surface soil from a farm in Minnesota (Tringe et al, 2005). Collectively, more than 2 Gbp of sequence data are available, and they provide the first opportunity for an unbiased assessment of the nucleotide composition of community DNA, because previous DNA collections (PCR based or cultivation dependent) can be assumed to have substantial experimental bias (Suzuki & Giovannoni, 1996). For all four environments, most of the sequences found (>90%) were from prokaryotic organisms, together with an unknown fraction of associated bacteriophages (but phage DNA did not influence the results; see below for specific tests). Sargasso Sea surface ocean water. For this environment, a total of 1,986,782 raw sequencing reads are available (Venter et al, 2004) from seven different water samples (∼2 Gbp of raw sequence). We chose to limit the analysis to samples #2–#4, constituting about 51% of the data, for two reasons: sample #1 is somewhat controversial (Delong, 2005), being the only sample that contains several dominating species—large fractions of their complete genomes could actually be assembled from the data. These dominating species showed a suspiciously low number of polymorphisms, and were not re-discovered in an independent sample from the same site. Therefore, it cannot be excluded that sample #1 has a certain fraction of clonally expanded, contaminating microbes—which is why it was omitted here. Samples #5–#7 were omitted because they had undergone various changes in filtering regimes (some selecting for large particle sizes only), and because they were not used for the assembly in the original publication. Minnesota farm surface soil. This data set consists of 198,529 raw sequencing reads (220 Mbp). However, the library preparation procedure that was applied to this sample included an amplification step, resulting in several clones with identical inserts. After removal of this redundancy, 149,139 sequencing reads remained, which were used for the present analysis. Acidic mine drainage biofilm. In all, 124,805 raw sequencing reads have been generated for this sample (Tyson et al, 2004), totalling about 124 Mbp of sequence. The original publication focused mainly on those reads that contributed to genome assembly, but for this study all reads were considered, independent of assembly. Deep-sea whale carcasses (‘whale fall’). Three subsamples have been analysed (Tringe et al, 2005), from two distinct carcasses, generating a total of 116,464 raw sequencing reads. The two carcasses are from distinct geographic locations, several thousand miles apart. All four environments vary with respect to the relative abundance and diversity of the bacterial species they contain. This leads to marked differences in the extent to which the raw reads could be assembled into larger contigs. The most extensive assembly was reported for the acid mine drainage community—here, more than two-thirds of the sequencing reads could be assembled into contigs (enabling the almost complete assembly of five genomes). At the other extreme, less than 1% of the soil sequences could be assembled (arguing for a very large diversity of species in the soil). The other two environments were between these two extremes: assembly rates were about 60% and 45% for the Sargasso Sea data and the whale-fall data, respectively. GC-content distributions. Generally, GC content was measured separately for each read, and all the values for an entire sample were then binned and plotted as a relative distribution of GC content. This indicates that the ‘window size’ of the GC-content measurement was equivalent to the average read length (between 900 and 1,100 bp, depending on sample). As a first consistency check, the analysis was limited to reads that showed an unequivocal homology to a known protein (scoring at least 60 bits in BLAST searches), or had been properly assembled into a longer contig that showed such homology (Fig 1A,B; supplementary Fig 1 online). This procedure filtered out reads of overall poor quality. As a second check, the analysis was further restricted to sequences that were clearly homologous to a set of 61 marker genes known to be present in all prokaryotic genomes studied so far, usually as single-copy genes (Fig 1B; supplementary Fig 1 online). This ensured that the result was not influenced by gene families of unknown or peripheral function that are potentially more amenable to horizontal transfer. The check also excluded any influence of bacteriophages, because the set of 61 marker genes—mainly ribosomal and translation-related genes—is usually absent from phages and viruses. Expected GC-content distributions. For each environmental data set, the approximate phylogenetic distribution of organisms was known (from marker genes or ribosomal RNA sequences). This allowed the computation of an expected GC-content distribution on the basis of traditional genome sequences, as follows: expected distributions were generated by sampling—from the 163 complete prokaryotic genome sequences in the STRING database (von Mering et al, 2005)—DNA fragments of lengths comparable with raw sequencing reads (a further two recent genomes were included to cover phyla that are not yet represented in STRING). The various phyla to be sampled were weighted to match the phylum distribution of the environmental sample studied (within each phylum, genomes were sampled evenly). From the sampled reads, the GC-content distributions were derived exactly in the same way as for the environments (Fig 1B; supplementary Fig 1 online). Supplementary information is available at EMBO reports online (http://www.nature.com/embor/journal/vaop/ncurrent/extref/7400538-s1.pdf). Acknowledgements This work was supported by the European Union, grant no. LSHG-CT-2003-503265. S.D.H. was supported by the Knut and Alice Wallenberg foundation. Supporting Information Supplementary Information (PDF document, 817.7 KB) References Bentley SD, Parkhill J (2004) Comparative genomic structure of prokaryotes. Annu Rev Genet 38: 771–792CrossrefCASPubMedWeb of Science®Google Scholar Bharanidharan D, Bhargavi GR, Uthanumallian K, Gautham N (2004) Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. Biochem Biophys Res Commun 315: 1097–1103CrossrefCASPubMedWeb of Science®Google Scholar Delong EF (2005) Microbial community genomics in the ocean. Nat Rev Microbiol 6: 459–469CrossrefCASWeb of Science®Google Scholar Fickett JW (1995) ORFs and genes: how strong a connection? J Comput Biol 2: 117–123CrossrefCASPubMedGoogle Scholar Foster PG, Jermiin LS, Hickey DA (1997) Nucleotide composition bias affects amino acid content in proteins coded by animal mitochondria. J Mol Evol 44: 282–288CrossrefCASPubMedWeb of Science®Google Scholar Glass JI, Lefkowitz EJ, Glass JS, Heiner CR, Chen EY, Cassell GH (2000) The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407: 757–762CrossrefCASPubMedWeb of Science®Google Scholar Haywood-Farmer E, Otto SP (2003) The evolution of genomic base composition in bacteria. Evol Int J Org Evol 57: 1783–1792Wiley Online LibraryCASPubMedWeb of Science®Google Scholar Heddi A, Charles H, Khatchadourian C, Bonnot G, Nardon P (1998) Molecular characterization of the principal symbiotic bacteria of the weevil Sitophilus oryzae: a peculiar G+C content of an endocytobiotic DNA. J Mol Evol 47: 52–61CrossrefCASPubMedWeb of Science®Google Scholar Herndl GJ, Mullerniklaus G, Frick J (1993) Major role of ultraviolet-B in controlling bacterioplankton growth in the surface-layer of the ocean. Nature 361: 717–719CrossrefWeb of Science®Google Scholar Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433: 633–638CrossrefCASPubMedWeb of Science®Google Scholar Lobry JR, Chessel D (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. J Appl Genet 44: 235–261PubMedWeb of Science®Google Scholar Marashi SA, Ghalanbor Z (2004) Correlations between genomic GC levels and optimal growth temperatures are not ‘robust’. Biochem Biophys Res Commun 325: 381–383CrossrefCASPubMedWeb of Science®Google Scholar McEwan CE, Gatherer D, McEwan NR (1998) Nitrogen-fixing aerobic bacteria have higher genomic GC content than non-fixing species within the same genus. Hereditas 128: 173–178Wiley Online LibraryCASPubMedWeb of Science®Google Scholar Moran NA (1996) Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA 93: 2873–2878CrossrefCASPubMedWeb of Science®Google Scholar Moran NA (2002) Microbial minimalism: genome reduction in bacterial pathogens. Cell 108: 583–586CrossrefCASPubMedWeb of Science®Google Scholar Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F, Bernardi G (2004) Correlations between genomic GC levels and optimal growth temperatures in prokaryotes. FEBS Lett 573: 73–77Wiley Online LibraryCASPubMedWeb of Science®Google Scholar Musto H, Naya H, Zavala A, Romero H, Alvarez-Valin F, Bernardi G (2005) The correlation between genomic G+C and optimal growth temperature of prokaryotes is robust: a reply to Marashi and Ghalanbor. Biochem Biophys Res Commun 330: 357–360CrossrefCASPubMedWeb of Science®Google Scholar Naya H, Romero H, Zavala A, Alvarez B, Musto H (2002) Aerobiosis increases the genomic guanine plus cytosine content (GC%) in prokaryotes. J Mol Evol 55: 260–264CrossrefCASPubMedWeb of Science®Google Scholar Paz A, Mester D, Baca I, Nevo E, Korol A (2004) Adaptive role of increased frequency of polypurine tracts in mRNA sequences of thermophilic prokaryotes. Proc Natl Acad Sci USA 101: 2951–2956CrossrefCASPubMedWeb of Science®Google Scholar Rocha EP, Danchin A (2002) Base composition bias might result from competition for metabolic resources. Trends Genet 18: 291–294CrossrefCASPubMedWeb of Science®Google Scholar Staley JT, Konopka A (1985) Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Annu Rev Microbiol 39: 321–346CrossrefCASPubMedWeb of Science®Google Scholar Stuiver M, Quay PD, Ostlund HG (1984) Abyssal water carbon-14 distribution and the age of the world oceans. Science 219: 849–851CrossrefWeb of Science®Google Scholar Sueoka N (1961) Correlation between base composition of deoxyribonucleic acid and amino acid composition of protein. Proc Natl Acad Sci USA 47: 1141–1149CrossrefCASPubMedWeb of Science®Google Scholar Sueoka N (1962) On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci USA 48: 582–592CrossrefCASPubMedWeb of Science®Google Scholar Suzuki MT, Giovannoni SJ (1996) Bias caused by template annealing in the amplification of mixtures of 16S rRNA genes by PCR. Appl Environ Microbiol 62: 625–630CrossrefCASPubMedWeb of Science®Google Scholar Tringe SG et al (2005) Comparative metagenomics of microbial communities. Science 308: 554–557CrossrefCASPubMedWeb of Science®Google Scholar Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43CrossrefCASPubMedWeb of Science®Google Scholar Venter JC et al (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66–74CrossrefCASPubMedWeb of Science®Google Scholar von Mering C, Jensen LJ, Snel B, Hooper SD, Krupp M, Foglierini M, Jouffre N, Huynen MA, Bork P (2005) STRING: known and predicted protein–protein associations, integrated and transferred across organisms. Nucleic Acids Res 33: D433–D437CrossrefCASPubMedWeb of Science®Google Scholar Next ArticlePrevious Article Volume 6Issue 121 December 2005In this issue FiguresReferencesRelatedDetailsLoading ...
DOI: 10.1016/s0896-6273(00)00046-5
2000
Cited 249 times
Prion Protein Devoid of the Octapeptide Repeat Region Restores Susceptibility to Scrapie in PrP Knockout Mice
Mice devoid of PrP are resistant to scrapie and fail to replicate the agent. Introduction of transgenes expressing PrP into such mice restores susceptibility to scrapie. We find that truncated PrP devoid of the five copper binding octarepeats still sustains scrapie infection; however, incubation times are longer and prion titers and protease-resistant PrP are about 30-fold lower than in wild-type mice. Surprisingly, brains of terminally ill animals show no histopathology typical for scrapie. However, in the spinal cord, infectivity, gliosis, and motor neuron loss are as in scrapie-infected wild-type controls. Thus, while the region comprising the octarepeats is not essential for mediating pathogenesis and prion replication, it modulates the extent of these events and of disease presentation.
DOI: 10.1093/nar/gkp937
2009
Cited 226 times
STITCH 2: an interaction network database for small molecules and proteins
Over the last years, the publicly available knowledge on interactions between small molecules and proteins has been steadily increasing. To create a network of interactions, STITCH aims to integrate the data dispersed over the literature and various databases of biological pathways, drug-target relationships and binding affinities. In STITCH 2, the number of relevant interactions is increased by incorporation of BindingDB, PharmGKB and the Comparative Toxicogenomics Database. The resulting network can be explored interactively or used as the basis for large-scale analyses. To facilitate links to other chemical databases, we adopt InChIKeys that allow identification of chemicals with a short, checksum-like string. STITCH 2.0 connects proteins from 630 organisms to over 74,000 different chemicals, including 2200 drugs. STITCH can be accessed at http://stitch.embl.de/.
DOI: 10.1371/journal.pbio.1000048
2009
Cited 219 times
Comparative Functional Analysis of the Caenorhabditis elegans and Drosophila melanogaster Proteomes
The nematode Caenorhabditis elegans is a popular model system in genetics, not least because a majority of human disease genes are conserved in C. elegans. To generate a comprehensive inventory of its expressed proteome, we performed extensive shotgun proteomics and identified more than half of all predicted C. elegans proteins. This allowed us to confirm and extend genome annotations, characterize the role of operons in C. elegans, and semiquantitatively infer abundance levels for thousands of proteins. Furthermore, for the first time to our knowledge, we were able to compare two animal proteomes (C. elegans and Drosophila melanogaster). We found that the abundances of orthologous proteins in metazoans correlate remarkably well, better than protein abundance versus transcript abundance within each organism or transcript abundances across organisms; this suggests that changes in transcript abundance may have been partially offset during evolution by opposing changes in protein abundance.
DOI: 10.1093/nar/gkp951
2009
Cited 217 times
eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations
The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224,847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2,242 035 proteins (built from 2,590,259 proteins) and provides a broad functional description for at least 1,966,709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de.
DOI: 10.1038/nbt988
2004
Cited 177 times
Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs
DOI: 10.1016/j.chom.2013.11.002
2013
Cited 148 times
Microbiota-Derived Hydrogen Fuels Salmonella Typhimurium Invasion of the Gut Ecosystem
The intestinal microbiota features intricate metabolic interactions involving the breakdown and reuse of host- and diet-derived nutrients. The competition for these resources can limit pathogen growth. Nevertheless, some enteropathogenic bacteria can invade this niche through mechanisms that remain largely unclear. Using a mouse model for Salmonella diarrhea and a transposon mutant screen, we discovered that initial growth of Salmonella Typhimurium (S. Tm) in the unperturbed gut is powered by S. Tm hyb hydrogenase, which facilitates consumption of hydrogen (H2), a central intermediate of microbiota metabolism. In competitive infection experiments, a hyb mutant exhibited reduced growth early in infection compared to wild-type S. Tm, but these differences were lost upon antibiotic-mediated disruption of the host microbiota. Additionally, introducing H2-consuming bacteria into the microbiota interfered with hyb-dependent S. Tm growth. Thus, H2 is an Achilles' heel of microbiota metabolism that can be subverted by pathogens and might offer opportunities to prevent infection.
DOI: 10.1038/msb.2008.35
2008
Cited 136 times
Millimeter‐scale genetic gradients and community‐level molecular convergence in a hypersaline microbial mat
Report3 June 2008Open Access Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat Victor Kunin Victor Kunin Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Jeroen Raes Jeroen Raes Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author J Kirk Harris J Kirk Harris Department of Pediatrics, University of Colorado Denver, Aurora, CO, USA Search for more papers by this author John R Spear John R Spear Division of Environmental Science and Engineering, Colorado School of Mines, Golden, CO, USA Search for more papers by this author Jeffrey J Walker Jeffrey J Walker Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO, USA Search for more papers by this author Natalia Ivanova Natalia Ivanova Genome Biology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Christian von Mering Christian von Mering Institute of Molecular Biology and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland Search for more papers by this author Brad M Bebout Brad M Bebout Microbial Ecology/Biogeochemistry Research Laboratory, Exobiology branch, NASA Ames Research Center, Moffett Field, CA, USA Search for more papers by this author Norman R Pace Norman R Pace Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO, USA Search for more papers by this author Peer Bork Peer Bork Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Philip Hugenholtz Corresponding Author Philip Hugenholtz Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Victor Kunin Victor Kunin Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Jeroen Raes Jeroen Raes Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author J Kirk Harris J Kirk Harris Department of Pediatrics, University of Colorado Denver, Aurora, CO, USA Search for more papers by this author John R Spear John R Spear Division of Environmental Science and Engineering, Colorado School of Mines, Golden, CO, USA Search for more papers by this author Jeffrey J Walker Jeffrey J Walker Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO, USA Search for more papers by this author Natalia Ivanova Natalia Ivanova Genome Biology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Christian von Mering Christian von Mering Institute of Molecular Biology and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland Search for more papers by this author Brad M Bebout Brad M Bebout Microbial Ecology/Biogeochemistry Research Laboratory, Exobiology branch, NASA Ames Research Center, Moffett Field, CA, USA Search for more papers by this author Norman R Pace Norman R Pace Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO, USA Search for more papers by this author Peer Bork Peer Bork Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany Search for more papers by this author Philip Hugenholtz Corresponding Author Philip Hugenholtz Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA Search for more papers by this author Author Information Victor Kunin1, Jeroen Raes2, J Kirk Harris3, John R Spear4, Jeffrey J Walker5, Natalia Ivanova6, Christian von Mering7, Brad M Bebout8, Norman R Pace5, Peer Bork2 and Philip Hugenholtz 1 1Microbial Ecology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA 2Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 3Department of Pediatrics, University of Colorado Denver, Aurora, CO, USA 4Division of Environmental Science and Engineering, Colorado School of Mines, Golden, CO, USA 5Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, CO, USA 6Genome Biology Program, DOE Joint Genome Institute, Walnut Creek, CA, USA 7Institute of Molecular Biology and Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland 8Microbial Ecology/Biogeochemistry Research Laboratory, Exobiology branch, NASA Ames Research Center, Moffett Field, CA, USA *Corresponding author. Microbial Ecology Program, DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA. Tel.: +1 925 296 5725; Fax: +1 925 296 5720; E-mail: [email protected] Molecular Systems Biology (2008)4:198https://doi.org/10.1038/msb.2008.35 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info To investigate the extent of genetic stratification in structured microbial communities, we compared the metagenomes of 10 successive layers of a phylogenetically complex hypersaline mat from Guerrero Negro, Mexico. We found pronounced millimeter-scale genetic gradients that were consistent with the physicochemical profile of the mat. Despite these gradients, all layers displayed near-identical and acid-shifted isoelectric point profiles due to a molecular convergence of amino-acid usage, indicating that hypersalinity enforces an overriding selective pressure on the mat community. Introduction Ecosystems often exhibit distinct gradients. Physicochemical gradients have long been documented, but only recently has environmental shotgun sequencing allowed the associated functional (gene-based) gradients of an ecosystems biota to be addressed. Macroscale functional gradients have been inferred from oceanic metagenomic data sets, both horizontally (Venter et al, 2004; Johnson et al, 2006; Rusch et al, 2007) and vertically (DeLong et al, 2006). Many structured microbial communities have been shown to produce steep physicochemical gradients on the scale of millimeters (Jorgensen et al, 1979; Schmitt-Wagner and Brune, 1999; Ludemann et al, 2000; Ley et al, 2006), but associated community-level functional gradients have not been demonstrated to date. Here, we investigate a complex, stratified, hypersaline microbial mat from Guerrero Negro, Baja California Sur, Mexico, as a model for fine-scale functional variation (Ley et al, 2006). The dense, tofu-like texture of this mat allows intact cross-sections to be obtained down to ∼1 mm thickness. The mat shows pronounced physicochemical variation both in space and time: oxygen is detected routinely in the top 2 mm during the day (up to 700 μM), and the mat is completely anoxic during the night. The permanently anoxic lower layers are characterized by micromolar levels of sulfide that increase with depth. The mat, dominated by bacteria, was reported to be one of the world's richest and most diverse microbial communities, comprising at least 752 observed species from 42 bacterial phyla, including 15 novel candidate phyla (Ley et al, 2006). As the mat grows in hypersaline waters (∼3 × the salinity of sea water), we were also interested to look for evidence of molecular adaptations to hypersalinity in the mat community. Results and discussion To investigate millimeter-scale genetic and associated functional stratification, we performed a metagenomic analysis of 10 spatially successive layers of the Guerrero Negro mat. Mat core samples were collected during the day (Supplementary Table S1) and upper layers were sectioned at a finer scale (1 mm slices) than the lower layers (4–15 mm slices) to capture variation associated with the steep oxygen gradient in the upper millimeters of the mat (Supplementary Table S2). DNA from each layer was cloned and shotgun-sequenced using capillary sequencing with an average of ∼13 000 reads per layer. No significant assembly of the reads was possible, even when all data were combined (largest contig was 8.4 kb from a combined assembly). We therefore chose to analyze only the unassembled data (average trimmed (Chou and Holmes, 2001) read length 808 bp) to avoid chimerism that has been reported to be frequent in contigs <10 kb (Mavromatis et al, 2007). Genes were predicted on vector and quality-trimmed reads with fgenesb (http://www.softberry.com/) using a generic bacterial model, resulting in an average of 13 600 genes per layer (Supplementary Table S2). These data have been deposited in genbank under accession numbers ABPP00000000 through ABPY00000000 and are available through the IMG/M system (Markowitz et al, 2006) (see SOM for access information). Using both bulk similarity matches and phylogenetic mapping of conserved marker genes (von Mering et al, 2007a), we found strong phylogenetic variation between layers. Cyanobacteria and Alphaproteobacteria were the most abundant lineages in the top two layers (Supplementary Figure S1). Below the upper 2 mm, Proteobacteria, Bacteroidetes, Chloroflexi and Planctomycetes were the most represented phyla, with a notable peak in Bacteroidetes at 3 mm (Supplementary Figure S1). Numerous traces of other bacterial phyla as well as some archaea and eukaryotes were also identified. A large fraction of predicted proteins in layers below 2 mm did not have significant sequence similarity to any protein in public databases, reflecting the high degree of phylum-level novelty in the mat community (Ley et al, 2006). These metagenome-based findings are in broad agreement with single-marker gene surveys of the mat (Spear et al, 2003; Ley et al, 2006). A rough measure of functional potential per organism can be made by estimating the average effective genome size (Raes et al, 2007). Using this method, we predicted an increased average bacterial genome size at the border of the oxic and anoxic zones (1–2 mm depth): 6 Mb at the border versus 3–3.5 Mb for the rest of the mat (Supplementary Figure S2). This may reflect an increased functional complexity needed for survival in the constantly fluctuating conditions at this depth as was recently observed in the genome of a marine Beggiotoa occupying a similar niche (Mussmann et al, 2007). To investigate genetic gradients through the mat, we determined the relative abundances of individual gene families and metabolic pathways between mat layers, and compared the mat data with external data sets for reference. Many gene families were highly abundant in the mat despite high overall functional diversity (Supplementary Figure S3) and very low sequence coverage of individual species. Indeed, the mat data set roughly doubled existing inventories for some of the gene families described below (Table I). This implies that multiple species and likely higher-level taxa contribute representatives of these families, and suggests that there has been a strong selection for a limited number of common functionalities in the mat. Table 1. Most prominent gene families and domains in the Guerrero Negro hypersaline mat core relative to other sequenced microbiome samplesa Gene family or domainb Annotation Mat AMD Soil Whalefall Gutless worm Sludge IMG COG3119 Arylsulfatase A and related enzymes 640 (640) 0 195 (145) 46 (165) 16 (77) 32 (127) 1154 (55) COG5598 Trimethylamine:corrinoid methyltransferase 112 (112) 0 16 (12) 5 (18) 52 (249) 3 (12) 114 (5) COG1148 Heterodisulfide reductase, subunit A and related polyferredoxins 172 (172) 0 16 (12) 5 (18) 40 (192) 0 185 (9) COG2414 Aldehyde:ferredoxin oxidoreductase 110 (110) 0 20 (15) 4 (14) 39 (187) 5 (20) 225 (11) Pfam05685 DUF820 domain 142(142) 3 (32) 63 (47) 0 8 (38) 10 (40) 825 (40) Numbers represent raw counts and numbers in parentheses are normalized for mat data set size. a Mat (combined data from all layers; present study), AMD (acid mine drainage biofilm; Tyson et al, 2004), soil (Tringe et al, 2005), whalefall (sample 3; Tringe et al, 2005), gutless worm (Woyke et al, 2006), sludge (US; Garcia Martin et al, 2006), IMG (version 2.20, combined data from 728 microbial genomes; Markowitz et al, 2006). b COG—cluster of orthologous genes (Tatusov et al, 1997), pfam (Bateman et al, 2002). The key aspect of this study was to use the metagenomic data to determine what, if any, millimeter-scale genetic gradients are detectable in this very complex and structured ecosystem. Several gene families and pathways either directly (Figure 1A) or inversely (Figure 1B) tracked the steep oxygen gradient in the top 2 mm of the mat and sulfide gradient below 2 mm. Genes directly involved in photosynthesis (KEGG map 00195) were statistically over-represented in the top two layers relative to lower layers. In addition, an uncharacterized protein domain (pfam05685) highly paralogous in phototrophic lineages (most cyanobacterial and some Chloroflexi genomes) showed a steep declining gradient in the top 6 mm (Figure 1A) consistent with dominance of phototrophs in the same region. Chaperones similarly tracked the oxygen gradient when all gene families with chaperone activity are combined together. The over-representation of chaperones in the top 2 mm relative to the rest of the mat may not be associated with oxygen concentration, but rather with heat stress caused by direct exposure to sunlight. Figure 1.Vertical gradients of gene families or groups of functionally related gene families enriched in the oxic zone (A), anoxic (high H2S) zone (B) and varying across the oxic–anoxic border (low H2S) zone (C). Relative abundance is normalized by the average number of genes in a layer. In most cases, these genes and groups of genes were over-represented relative to other metagenomic data sets (Table I). Error bars denote standard deviations calculated from 1000 bootstrap resamplings of predicted proteins, and points with non-overlapping error bars are treated as significantly different. Lists of gene families used in each group (Photosynthesis-related proteins, Chaperones, Ferredoxins and associated proteins, Sugar degradation pathways, Chemotaxis and Flagella) as well as details of the resampling procedure are given in Supplementary information. Download figure Download PowerPoint Gene families and pathways that tracked inversely with oxygen concentration included ferredoxins, trimethylamine methyltransferase (Mttb), sulfatases and sugar degradation pathways (Figure 1B). Ferredoxins and associated proteins show a four-fold increase from the top layer down to a depth of 4 mm and thereafter are uniformly over-represented. Two COG families are chiefly responsible for this trend: COG1148 (heterodisulfide reductase, subunit A and related polyferredoxins) and COG2414 (aldehyde:ferredoxin oxidoreductase). The expansion of ferredoxins in the anoxic layers likely reflects the diversification of redox reactions required for anaerobic respiration. Mttb (pfam06253, COG5598) methyltransferase does not become significantly over-represented until at least 7 mm into the mat (Figure 1B), well below the anoxic boundary. Mttb was initially identified as a protein facilitating the first step of methanogenesis from trimethylamine in Methanosarcinaceae (Paul et al, 2000). However, this gene family is also found in methylotrophic bacteria (e.g. in Rhodobacteraceae and Rhizobiaceae), suggesting a more generalized role in C1 metabolism. One of the most pronounced inverse gradients is observed for sulfatases (COG3119) that are involved in the hydrolysis of sulfated organic compounds (Figure 1B). As sulfatases can function in the presence of oxygen, the gradient is presumably a reflection of the availability of sulfated compounds in the mat. Although the concentration gradient of sulfated compounds is not known in the mat, they are produced by phototrophs (Kates, 1986) and are widespread in marine environments (Glockner et al, 2003). Sulfatase genes obtained from the mat exhibited extensive sequence divergence, suggesting that a corresponding wide variety of sulfated organic substrates are present in the mat, with the highest concentrations below 2 mm. The over-representation of this gene family may in part be due to an expansion of sulfatase genes in the genomes of Planctomycetes, suggested to be involved particularly in the hydrolysis of sulfated glycopolymers (Glockner et al, 2003). Sugar degradation pathways (glycolysis and pentose and uronic acid degradation) show a two-fold increase with depth through the top 3 mm and maintain high relative representation in the anoxic lower layers (Figure 1B). This suggests that heterotrophic metabolism of sugars, particularly pentoses and uronic acids, is important in the lower layers. Organisms living at the boundary between the oxic and anoxic zones could potentially accumulate substrates with high reductive potential in the anoxic zone, and then move to the oxic zone to harvest this potential by oxidation (Mussmann et al, 2007). This would require boundary zone organisms to be motile and chemotactic. Indeed, we find that chemotaxis signature genes peak sharply at the oxic–anoxic boundary (Figure 1C). Flagella appear not to be the dominant source of motility in these chemotactic organisms as flagellar genes actually dip in this region (Figure 1C). Chemotactic gliding bacteria have been observed in fresh mat cores (Garcia-Pichel et al, 1994; Kruschel and Castenholz, 1998) and our molecular data suggest they are most abundant in the boundary zone, bridging the oxic and anoxic layers. Despite the pronounced phylogenetic and functional gradients in the mat, hypersalinity is a selective pressure common to the whole community. A known adaptation to hypersalinity is enrichment of proteins with acidic amino acids, allowing proteins to function in high cytoplasmic salt concentrations (Soppa, 2006). The resulting acid-shifted protein isoelectric points have been documented in the genomes of only two lineages, the archaeal class Halobacteria (Kennedy et al, 2001; Soppa, 2006) and the bacterial species Salinibacter ruber (Oren and Mana, 2002; Mongodin et al, 2005), so it is unclear how widespread this mechanism is in halophilic communities. The average isoelectric points of the mat layer communities are conspicuously acid-shifted when compared with most bacteria and microbiomes that are non-halophilic (Figure 2A). We determined this to be due primarily to an enrichment in the acidic amino acid aspartate (Figure 2B). Furthermore, the isoelectric profiles of all 10 layers converge on a common acid-shifted profile (Figure 3A) despite a significant variation in GC content between layers (Figure 3B), reflecting differing phylogenetic composition. The latter is consistent with aspartate usage being GC-independent as it can be encoded by both GC-rich and GC-poor codons (GAC and GAT, respectively). As each metagenomic read pair is likely derived from different species and no single species dominates the mat community, we conclude that a significant fraction of the community has converged on the enrichment of low isoelectric point proteins. Figure 2.Average isoelectric point (A) and aspartate content (B) of all predicted proteins in the mat layer communities and reference bacteria, archaea, phages and microbiomes available through IMG/M (Markowitz et al, 2006). Genomic average was computed for each genome or microbiome, with 10 layers of the mat treated separately. These values were rounded up to the next (larger value) bin in increments of 0.2 and 0.5 in (A) and (B) respectively, and the distribution of the bins was plotted as a fraction of each data set. Download figure Download PowerPoint Figure 3.Isoelectric point profiles of predicted proteins (A) and GC content profiles of reads (B) for mat layer communities. In (A), isoelectric point profiles for selected reference genomes are added to highlight the highly overlapping and acid-shifted mat layer profiles. Download figure Download PowerPoint In summary, this study demonstrates that millimeter-scale genetic gradients can be readily discerned through a vertical cross-section of a highly structured and complex microbial community using low sequence coverage. Furthermore, we could directly and inversely correlate many of the genetic gradients to the physicochemical profile of the mat. Microbial biofilms are important in many habitats, including our own bodies (Kroes et al, 1999; Eckburg et al, 2005), and often display physicochemical gradients at millimeter to centimeter scales. However, few biofilms are as robust as microbial mats and methods may need to be adapted to preserve spatial structure (Webster et al, 2006) and allow the relevant fine-scale genetic gradients to be resolved. Surprisingly, we found that adaptation to hypersalinity by enriching proteins with acidic amino acids is more widespread than previously appreciated. Although this is the first example of species-independent molecular convergence in a microbial community, we predict that similar convergence patterns will be observed in other communities adapted to similar or different environmental conditions, such as temperature (Gianese et al, 2001) or pressure (Simonato et al, 2006; Lauro and Bartlett, 2008). Materials and methods Mat core samples were collected around 1400 hours from pond 4 near pond 5 at the Exportadora de Sal saltworks, Guerrero Negro, Baja California Sur, Mexico. The salinity of the bulk water above the mat was ∼9% (∼3 × the salinity of sea water). Other metadata for the sample can be found in Supplementary Table S1. Four replicate cores were collected, sectioned into layers with sterile scalpels and DNA extracted, normalized, pooled and sequenced as described in Supplementary information. Metagenome sequence data are available under the following GenBank accession numbers: ABPP00000000, ABPQ00000000, ABPR00000000, ABPS00000000, ABPT00000000, ABPU00000000, ABPV00000000, ABPW00000000, ABPX00000000, ABPY00000000 Community composition analysis was performed using the consensus of (i) best BlastP hits (Altschul et al, 1997) to the IMG/M database (Markowitz et al, 2006) and (ii) phylogenetic mapping of signature genes on a phylogenetic tree (von Mering et al, 2007a). See Supplementary information for details. Gene-based functional gradients were calculated as follows: genes were assigned to their COG families (Tatusov et al, 1997) and pfam domains (Bateman et al, 2002) based on rpsBLAST (Altschul et al, 1997). The gradients were examined for possible over-representation of groups or individual families or domains, and 1000 bootstrap iterations were used to assess the significance of over-representation. The described gradients were independently confirmed using two databases: IMG/M (Markowitz et al, 2006) and the STRING database (von Mering et al, 2007b). Further details as well as groupings of families/domains are described in Supplementary information. Isoelectric point distributions, amino-acid composition and GC content were computed using appropriate perl scripts and modules as described in Supplementary information. Acknowledgements We thank Amber Hartman for fruitful discussions and the Exportadora de Sal saltworks in Guerrero Negro, Baja California Sur, for access and assistance with the field site. We also thank the NASA funded researchers at NASA Ames who assisted with the field work: David Des Marais, Moira Doty, Tori Hoehler, Mary Hogan and Kendra Turk. Sequencing was provided by the JGI Community Sequencing Program. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract no. W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract no. DE-AC02-05CH11231 and Los Alamos National Laboratory under contract no. DE-AC02-06NA25396. JR and PB are supported by the European Union 6th Framework Program (contract no. LSHG-CT-2004-503567). Supporting Information Supplementary Material (PDF document, 1.7 MB) Supplementary Information (application/xls, 205.5 KB) References Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402CrossrefCASPubMedWeb of Science®Google Scholar Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30: 276–280CrossrefCASPubMedWeb of Science®Google Scholar Chou HH, Holmes MH (2001) DNA sequence quality trimming and vector removal. Bioinformatics (Oxford, England) 17: 1093–1104CrossrefCASPubMedWeb of Science®Google Scholar DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM (2006) Community genomics among stratified microbial assemblages in the ocean's interior. Science (New York, NY) 311: 496–503Google Scholar Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA (2005) Diversity of the human intestinal microbial flora. Science (New York, NY) 308: 1635–1638CrossrefPubMedWeb of Science®Google Scholar Garcia Martin H, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, Dalin E, Putnam NH, Shapiro HJ, Pangilinan JL, Rigoutsos I, Kyrpides NC, Blackall LL, McMahon KD, Hugenholtz P (2006) Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat Biotechnol 24: 1263–1269CrossrefPubMedWeb of Science®Google Scholar Garcia-Pichel F, Mechling M, Castenholz RW (1994) Diel migrations of microorganisms within a benthic, hypersaline mat community. Appl Environ Microbiol 60: 1500–1511CASPubMedWeb of Science®Google Scholar Gianese G, Argos P, Pascarella S (2001) Structural adaptation of enzymes to low temperatures. Protein Eng 14: 141–148CrossrefCASPubMedWeb of Science®Google Scholar Glockner FO, Kube M, Bauer M, Teeling H, Lombardot T, Ludwig W, Gade D, Beck A, Borzym K, Heitmann K, Rabus R, Schlesner H, Amann R, Reinhardt R (2003) Complete genome sequence of the marine planctomycete Pirellula sp. strain 1. Proc Natl Acad Sci USA 100: 8298–8303CrossrefCASPubMedWeb of Science®Google Scholar http://www.softberry.com/ SoftBerry -fgenesbGoogle Scholar Johnson ZI, Zinser ER, Coe A, McNulty NP, Woodward EM, Chisholm SW (2006) Niche partitioning among Prochlorococcus ecotypes along ocean-scale environmental gradients. Science (New York, NY) 311: 1737–1740CrossrefCASPubMedWeb of Science®Google Scholar Jorgensen BB, Revsbech NP, Blackburn TH, Cohen Y (1979) Diurnal cycle of oxygen and sulfide microgradients and microbial photosynthesis in a cyanobacterial mat sediment. Appl Environ Microbiol 38: 46–58CASPubMedWeb of Science®Google Scholar Kates M (1986) Techniques of Lipidology: Isolation, Analysis, and Identification of Lipids, 2nd revised edn. Amsterdam, New York: ElsevierGoogle Scholar Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S (2001) Understanding the adaptation of Halobacterium species NRC-1 to its extreme environment through computational analysis of its genome sequence. Genome Res 11: 1641–1650CrossrefCASPubMedWeb of Science®Google Scholar Kroes I, Lepp PW, Relman DA (1999) Bacterial diversity within the human subgingival crevice. Proc Natl Acad Sci USA 96: 14547–14552CrossrefCASPubMedWeb of Science®Google Scholar Kruschel C, Castenholz R (1998) The effect of solar UV and visible irradiance on the vertical movements of cyanobacteria in microbial mats of hypersaline waters. FEMS Microbiol Ecol 27: 53–72Wiley Online LibraryCASWeb of Science®Google Scholar Lauro FM, Bartlett DH (2008) Prokaryotic lifestyles in deep sea habitats. Extremophiles 12: 15–25CrossrefPubMedWeb of Science®Google Scholar Ley RE, Harris JK, Wilcox J, Spear JR, Miller SR, Bebout BM, Maresca JA, Bryant DA, Sogin ML, Pace NR (2006) Unexpected diversity and complexity of the Guerrero Negro hypersaline microbial mat. Appl Environ Microbiol 72: 3685–3695CrossrefCASPubMedWeb of Science®Google Scholar Ludemann H, Arth I, Liesack W (2000) Spatial changes in the bacterial community structure along a vertical oxygen gradient in flooded paddy soil cores. Appl Environ Microbiol 66: 754–762CrossrefCASPubMedWeb of Science®Google Scholar Markowitz VM, Ivanova N, Palaniappan K, Szeto E, Korzeniewski F, Lykidis A, Anderson I, Mavrommatis K, Kunin V, Garcia Martin H, Dubchak I, Hugenholtz P, Kyrpides NC (2006) An experimental metagenome data management and analysis system. Bioinformatics (Oxford, England) 22: e359–e367CrossrefCASPubMedWeb of Science®Google Scholar Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4: 495–500CrossrefCASPubMedWeb of Science®Google Scholar Mongodin EF, Nelson KE, Daugherty S, Deboy RT, Wister J, Khouri H, Weidman J, Walsh DA, Papke RT, Sanchez Perez G, Sharma AK, Nesbo CL, MacLeod D, Bapteste E, Doolittle WF, Charlebois RL, Legault B, Rodriguez-Valera F (2005) The genome of Salinibacter ruber: convergence and gene exchange among hyperhalophilic bacteria and archaea. Proc Natl Acad Sci USA 102: 18147–18152CrossrefCASPubMedWeb of Science®Google Scholar Mussmann M, Hu FZ, Richter M, de Beer D, Preisler A, Jorgensen BB, Huntemann M, Glockner FO, Amann R, Koopman WJ, Lasken RS, Janto B, Hogg J, Stoodley P, Boissy R, Ehrlich GD (2007) Insights into the genome of large sulfur bacteria revealed by analysis of single filaments. PLoS Biol 5: e230CrossrefPubMedWeb of Science®Google Scholar Oren A, Mana L (2002) Amino acid composition of bulk protein and salt relationships of selected enzymes of Salinibacter ruber, an extremely halophilic bacterium. Extremophiles 6: 217–223CrossrefCASPubMedWeb of Science®Google Scholar Paul L, Ferguson DJ, Krzycki JA (2000) The trimethylamine methyltransferase gene and multiple dimethylamine methyltransferase genes of Methanosarcina barkeri contain in-frame and read-through amber codons. J Bacteriol 182: 2520–2529CrossrefCASPubMedWeb of Science®Google Scholar Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P (2007) Prediction of effective genome size in metagenomic samples. Genome Biol 8: R10CrossrefCASPubMedWeb of Science®Google Scholar Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K et al (2007) The Sorcerer II global ocean sampling expedition: northwest Atlantic through eastern tropical pacific. PLoS Biol 5: e77CrossrefPubMedWeb of Science®Google Scholar Schmitt-Wagner D, Brune A (1999) Hydrogen profiles and localization of methanogenic activities in the highly compartmentalized hindgut of soil-feeding higher termites (Cubitermes spp). Appl Environ Microbiol 65: 4490–4496CASPubMedWeb of Science®Google Scholar Simonato F, Campanaro S, Lauro FM, Vezzi A, D'Angelo M, Vitulo N, Valle G, Bartlett DH (2006) Piezophilic adaptation: a genomic point of view. J Biotechnol 126: 11–25CrossrefCASPubMedWeb of Science®Google Scholar Soppa J (2006) From genomes to function: haloarchaea as model organisms. Microbiology 152: 585–590CrossrefCASPubMedWeb of Science®Google Scholar Spear JR, Ley RE, Berger AB, Pace NR (2003) Complexity in natural microbial ecosystems: the Guerrero Negro experience. Biol Bull 204: 168–173CrossrefCASPubMedWeb of Science®Google Scholar Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science (New York, NY) 278: 631–637CrossrefCASPubMedWeb of Science®Google Scholar Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM (2005) Comparative metagenomics of microbial communities. Science (New York, NY) 308: 554–557Google Scholar Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF (2004) Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 37–43CrossrefCASPubMedWeb of Science®Google Scholar Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H et al (2004) Environmental genome shotgun sequencing of the Sargasso Sea. Science (New York, NY) 304: 66–74Google Scholar von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P (2007a) Quantitative phylogenetic assessment of microbial communities in diverse environments. Science (New York, NY) 315: 1126–1130Google Scholar von Mering C, Jensen LJ, Kuhn M, Chaffron S, Doerks T, Kruger B, Snel B, Bork P (2007b) STRING 7—recent developments in the integration and prediction of protein interactions. Nucleic Acids Res 35: D358–D362CrossrefCASPubMedWeb of Science®Google Scholar Webster P, Wu S, Gomez G, Apicella M, Plaut AG, St Geme JW (2006) Distribution of bacterial proteins in biofilms formed by non-typeable Haemophilus influenzae. J Histochem Cytochem 54: 829–842CrossrefCASPubMedWeb of Science®Google Scholar Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N (2006) Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443: 950–955CrossrefCASPubMedWeb of Science®Google Scholar Previous ArticleNext Article Volume 4Issue 11 January 2008In this issue FiguresReferencesRelatedDetailsLoading ...
DOI: 10.1186/1471-2164-11-461
2010
Cited 113 times
MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies
Abstract Background Shotgun sequencing of environmental DNA is an essential technique for characterizing uncultivated microbes in situ . However, the taxonomic and functional assignment of the obtained sequence fragments remains a pressing problem. Results Existing algorithms are largely optimized for speed and coverage; in contrast, we present here a software framework that focuses on a restricted set of informative gene families, using Maximum Likelihood to assign these with the best possible accuracy. This framework ('MLTreeMap'; http://mltreemap.org/ ) uses raw nucleotide sequences as input, and includes hand-curated, extensible reference information. Conclusions We discuss how we validated our pipeline using complete genomes as well as simulated and actual environmental sequences.
DOI: 10.3390/v10100519
2018
Cited 108 times
Viruses.STRING: A Virus-Host Protein-Protein Interaction Database
As viruses continue to pose risks to global health, having a better understanding of virus⁻host protein⁻protein interactions aids in the development of treatments and vaccines. Here, we introduce Viruses.STRING, a protein⁻protein interaction database specifically catering to virus⁻virus and virus⁻host interactions. This database combines evidence from experimental and text-mining channels to provide combined probabilities for interactions between viral and host proteins. The database contains 177,425 interactions between 239 viruses and 319 hosts. The database is publicly available at viruses.string-db.org, and the interaction data can also be accessed through the latest version of the Cytoscape STRING app.
DOI: 10.1111/1462-2920.12610
2014
Cited 107 times
Limits to robustness and reproducibility in the demarcation of operational taxonomic units
The demarcation of operational taxonomic units (OTUs) from complex sequence data sets is a key step in contemporary studies of microbial ecology. However, as biologically motivated 'optimal' OTU-binning algorithms remain elusive, many conceptually distinct approaches continue to be used. Using a global data set of 887 870 bacterial 16S rRNA gene sequences, we objectively quantified biases introduced by several widely employed sequence clustering algorithms. We found that OTU-binning methods often provided surprisingly non-equivalent partitions of identical data sets, notably when clustering to the same nominal similarity thresholds; and we quantified the resulting impact on ecological data description for a well-defined human skin microbiome data set. We observed that some methods were very robust to varying clustering thresholds, while others were found to be highly susceptible even to slight threshold variations. Moreover, we comprehensively quantified the impact of the choice of 16S rRNA gene subregion, as well as of data set scope and context on algorithm performance. Our findings may contribute to an enhanced comparability of results across sequence-processing pipelines, and we arrive at recommendations towards higher levels of standardization in established workflows.
DOI: 10.1093/bioinformatics/btx517
2017
Cited 104 times
MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis
Ribosomal RNA profiling has become crucial to studying microbial communities, but meaningful taxonomic analysis and inter-comparison of such data are still hampered by technical limitations, between-study design variability and inconsistencies between taxonomies used.Here we present MAPseq, a framework for reference-based rRNA sequence analysis that is up to 30% more accurate (F½ score) and up to one hundred times faster than existing solutions, providing in a single run multiple taxonomy classifications and hierarchical operational taxonomic unit mappings, for rRNA sequences in both amplicon and shotgun sequencing strategies, and for datasets of virtually any size.Source code and binaries are freely available at https://github.com/jfmrod/mapseq.mering@imls.uzh.ch.Supplementary data are available at Bioinformatics online.
DOI: 10.1186/s40168-017-0234-1
2017
Cited 98 times
Sputum DNA sequencing in cystic fibrosis: non-invasive access to the lung microbiome and to pathogen details
Cystic fibrosis (CF) is a life-threatening genetic disorder, characterized by chronic microbial lung infections due to abnormally viscous mucus secretions within airways. The clinical management of CF typically involves regular respiratory-tract cultures in order to identify pathogens and to guide treatment. However, culture-based methods can miss atypical or slow-growing microbes. Furthermore, the isolated microbes are often not classified at the strain level due to limited taxonomic resolution.Here, we show that untargeted metagenomic sequencing of sputum DNA can provide valuable information beyond the possibilities of culture-based diagnosis. We sequenced the sputum of six CF patients and eleven control samples (including healthy subjects and chronic obstructive pulmonary disease patients) without prior depletion of human DNA or cell size selection, thus obtaining the most unbiased and comprehensive characterization of CF respiratory tract microbes to date. We present detailed descriptions of the CF and healthy lung microbiome, reconstruct near complete pathogen genomes, and confirm that the CF lungs consistently exhibit reduced microbial diversity. Crucially, the obtained genomic sequences enabled a detailed identification of the exact pathogen strain types, when analyzed in conjunction with existing multi-locus sequence typing databases. We also detected putative pathogenicity islands and indicators of antibiotic resistance, in good agreement with independent clinical tests.Unbiased sputum metagenomics provides an in-depth profile of the lung pathogen microbiome, which is complementary to and more detailed than standard culture-based reporting. Furthermore, functional and taxonomic features of the dominant pathogens, including antibiotics resistances, can be deduced-supporting accurate and non-invasive clinical diagnosis.
DOI: 10.1016/j.cels.2019.08.002
2019
Cited 97 times
Rapid Inference of Direct Interactions in Large-Scale Ecological Networks from Heterogeneous Microbial Sequencing Data
The availability of large-scale metagenomic sequencing data can facilitate the understanding of microbial ecosystems in unprecedented detail. However, current computational methods for predicting ecological interactions are hampered by insufficient statistical resolution and limited computational scalability. They also do not integrate metadata, which can reduce the interpretability of predicted ecological patterns. Here, we present FlashWeave, a computational approach based on a flexible Probabilistic Graphical Model framework that integrates metadata and predicts direct microbial interactions from heterogeneous microbial abundance data sets with hundreds of thousands of samples. FlashWeave outperforms state-of-the-art methods on diverse benchmarking challenges in terms of runtime and accuracy. We use FlashWeave to analyze a cross-study data set of 69,818 publicly available human gut samples and produce, to the best of our knowledge, the largest and most diverse network of predicted, direct gastrointestinal microbial interactions to date. FlashWeave is freely available for download here: https://github.com/meringlab/FlashWeave.jl.
DOI: 10.1186/1471-2164-15-82
2014
Cited 95 times
Chromothripsis-like patterns are recurring but heterogeneously distributed features in a survey of 22,347 cancer genome screens
Chromothripsis is a recently discovered phenomenon of genomic rearrangement, possibly arising during a single genome-shattering event. This could provide an alternative paradigm in cancer development, replacing the gradual accumulation of genomic changes with a "one-off" catastrophic event. However, the term has been used with varying operational definitions, with the minimal consensus being a large number of locally clustered copy number aberrations. The mechanisms underlying these chromothripsis-like patterns (CTLP) and their specific impact on tumorigenesis are still poorly understood.Here, we identified CTLP in 918 cancer samples, from a dataset of more than 22,000 oncogenomic arrays covering 132 cancer types. Fragmentation hotspots were found to be located on chromosome 8, 11, 12 and 17. Among the various cancer types, soft-tissue tumors exhibited particularly high CTLP frequencies. Genomic context analysis revealed that CTLP rearrangements frequently occurred in genomes that additionally harbored multiple copy number aberrations (CNAs). An investigation into the affected chromosomal regions showed a large proportion of arm-level pulverization and telomere related events, which would be compatible to a number of underlying mechanisms. We also report evidence that these genomic events may be correlated with patient age, stage and survival rate.Through a large-scale analysis of oncogenomic array data sets, this study characterized features associated with genomic aberrations patterns, compatible to the spectrum of "chromothripsis"-definitions as previously used. While quantifying clustered genomic copy number aberrations in cancer samples, our data indicates an underlying biological heterogeneity behind these chromothripsis-like patterns, beyond a well defined "chromthripsis" phenomenon.
DOI: 10.1038/s41587-020-0434-2
2020
Cited 92 times
ChromID identifies the protein interactome at chromatin marks
Chromatin modifications regulate genome function by recruiting proteins to the genome. However, the protein composition at distinct chromatin modifications has yet to be fully characterized. In this study, we used natural protein domains as modular building blocks to develop engineered chromatin readers (eCRs) selective for DNA methylation and histone tri-methylation at H3K4, H3K9 and H3K27 residues. We first demonstrated their utility as selective chromatin binders in living cells by stably expressing eCRs in mouse embryonic stem cells and measuring their subnuclear localization, genomic distribution and histone-modification-binding preference. By fusing eCRs to the biotin ligase BASU, we established ChromID, a method for identifying the chromatin-dependent protein interactome on the basis of proximity biotinylation, and applied it to distinct chromatin modifications in mouse stem cells. Using a synthetic dual-modification reader, we also uncovered the protein composition at bivalently modified promoters marked by H3K4me3 and H3K27me3. These results highlight the ability of ChromID to obtain a detailed view of protein interaction networks on chromatin.
DOI: 10.1093/bioinformatics/btv696
2015
Cited 89 times
SVD-phy: improved prediction of protein functional associations through singular value decomposition of phylogenetic profiles
Abstract Summary: A successful approach for predicting functional associations between non-homologous genes is to compare their phylogenetic distributions. We have devised a phylogenetic profiling algorithm, SVD-Phy, which uses truncated singular value decomposition to address the problem of uninformative profiles giving rise to false positive predictions. Benchmarking the algorithm against the KEGG pathway database, we found that it has substantially improved performance over existing phylogenetic profiling methods. Availability and implementation: The software is available under the open-source BSD license at https://bitbucket.org/andrea/svd-phy Contact: lars.juhl.jensen@cpr.ku.dk Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1016/j.molcel.2019.03.041
2019
Cited 86 times
Cross-Regulation between TDP-43 and Paraspeckles Promotes Pluripotency-Differentiation Transition
RNA-binding proteins (RBPs) and long non-coding RNAs (lncRNAs) are key regulators of gene expression, but their joint functions in coordinating cell fate decisions are poorly understood. Here we show that the expression and activity of the RBP TDP-43 and the long isoform of the lncRNA Neat1, the scaffold of the nuclear compartment "paraspeckles," are reciprocal in pluripotent and differentiated cells because of their cross-regulation. In pluripotent cells, TDP-43 represses the formation of paraspeckles by enhancing the polyadenylated short isoform of Neat1. TDP-43 also promotes pluripotency by regulating alternative polyadenylation of transcripts encoding pluripotency factors, including Sox2, which partially protects its 3' UTR from miR-21-mediated degradation. Conversely, paraspeckles sequester TDP-43 and other RBPs from mRNAs and promote exit from pluripotency and embryonic patterning in the mouse. We demonstrate that cross-regulation between TDP-43 and Neat1 is essential for their efficient regulation of a broad network of genes and, therefore, of pluripotency and differentiation.
DOI: 10.1016/j.celrep.2017.04.028
2017
Cited 84 times
High-Resolution RNA Maps Suggest Common Principles of Splicing and Polyadenylation Regulation by TDP-43
<h2>Summary</h2> Many RNA-binding proteins (RBPs) regulate both alternative exons and poly(A) site selection. To understand their regulatory principles, we developed expressRNA, a web platform encompassing computational tools for integration of iCLIP and RNA motif analyses with RNA-seq and 3′ mRNA sequencing. This reveals at nucleotide resolution the "RNA maps" describing how the RNA binding positions of RBPs relate to their regulatory functions. We use this approach to examine how TDP-43, an RBP involved in several neurodegenerative diseases, binds around its regulated poly(A) sites. Binding close to the poly(A) site generally represses, whereas binding further downstream enhances use of the site, which is similar to TDP-43 binding around regulated exons. Our RNAmotifs2 software also identifies sequence motifs that cluster together with the binding motifs of TDP-43. We conclude that TDP-43 directly regulates diverse types of pre-mRNA processing according to common position-dependent principles.
DOI: 10.1371/journal.pcbi.1003594
2014
Cited 82 times
Ecological Consistency of SSU rRNA-Based Operational Taxonomic Units at a Global Scale
Operational Taxonomic Units (OTUs), usually defined as clusters of similar 16S/18S rRNA sequences, are the most widely used basic diversity units in large-scale characterizations of microbial communities. However, it remains unclear how well the various proposed OTU clustering algorithms approximate 'true' microbial taxa. Here, we explore the ecological consistency of OTUs – based on the assumption that, like true microbial taxa, they should show measurable habitat preferences (niche conservatism). In a global and comprehensive survey of available microbial sequence data, we systematically parse sequence annotations to obtain broad ecological descriptions of sampling sites. Based on these, we observe that sequence-based microbial OTUs generally show high levels of ecological consistency. However, different OTU clustering methods result in marked differences in the strength of this signal. Assuming that ecological consistency can serve as an objective external benchmark for cluster quality, we conclude that hierarchical complete linkage clustering, which provided the most ecologically consistent partitions, should be the default choice for OTU clustering. To our knowledge, this is the first approach to assess cluster quality using an external, biologically meaningful parameter as a benchmark, on a global scale.
DOI: 10.1038/s41467-020-14367-0
2020
Cited 75 times
Pathway and network analysis of more than 2500 whole cancer genomes
Abstract The catalog of cancer driver mutations in protein-coding genes has greatly expanded in the past decade. However, non-coding cancer driver mutations are less well-characterized and only a handful of recurrent non-coding mutations, most notably TERT promoter mutations, have been reported. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancer across 38 tumor types, we perform multi-faceted pathway and network analyses of non-coding mutations across 2583 whole cancer genomes from 27 tumor types compiled by the ICGC/TCGA PCAWG project that was motivated by the success of pathway and network analyses in prioritizing rare mutations in protein-coding genes. While few non-coding genomic elements are recurrently mutated in this cohort, we identify 93 genes harboring non-coding mutations that cluster into several modules of interacting proteins. Among these are promoter mutations associated with reduced mRNA expression in TP53 , TLE4 , and TCF4 . We find that biological processes had variable proportions of coding and non-coding mutations, with chromatin remodeling and proliferation pathways altered primarily by coding mutations, while developmental pathways, including Wnt and Notch, altered by both coding and non-coding mutations. RNA splicing is primarily altered by non-coding mutations in this cohort, and samples containing non-coding mutations in well-known RNA splicing factors exhibit similar gene expression signatures as samples with coding mutations in these genes. These analyses contribute a new repertoire of possible cancer genes and mechanisms that are altered by non-coding mutations and offer insights into additional cancer vulnerabilities that can be investigated for potential therapeutic treatments.
DOI: 10.1038/s41396-020-0600-z
2020
Cited 73 times
Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity
Microbial organisms inhabit virtually all environments and encompass a vast biological diversity. The pangenome concept aims to facilitate an understanding of diversity within defined phylogenetic groups. Hence, pangenomes are increasingly used to characterize the strain diversity of prokaryotic species. To understand the interdependence of pangenome features (such as the number of core and accessory genes) and to study the impact of environmental and phylogenetic constraints on the evolution of conspecific strains, we computed pangenomes for 155 phylogenetically diverse species (from ten phyla) using 7,000 high-quality genomes to each of which the respective habitats were assigned. Species habitat ubiquity was associated with several pangenome features. In particular, core-genome size was more important for ubiquity than accessory genome size. In general, environmental preferences had a stronger impact on pangenome evolution than phylogenetic inertia. Environmental preferences explained up to 49% of the variance for pangenome features, compared with 18% by phylogenetic inertia. This observation was robust when the dataset was extended to 10,100 species (59 phyla). The importance of environmental preferences was further accentuated by convergent evolution of pangenome features in a given habitat type across different phylogenetic clades. For example, the soil environment promotes expansion of pangenome size, while host-associated habitats lead to its reduction. Taken together, we explored the global principles of pangenome evolution, quantified the influence of habitat, and phylogenetic inertia on the evolution of pangenomes and identified criteria governing species ubiquity and habitat specificity.
DOI: 10.1038/nrd.2018.52
2018
Cited 68 times
Erratum: Unexplored therapeutic opportunities in the human genome
Nature Reviews Drug Discovery (2018); 10.1038/nrd.2018.14 In the version of this article that was originally published online, an older version of the data set categorizing proteins into target development levels was used to create Figure 1 than the version used to create Table 1, and data from Figure 1 were referred to at several points in the text of the article.
DOI: 10.1021/acs.jproteome.2c00651
2022
Cited 37 times
Cytoscape stringApp 2.0: Analysis and Visualization of Heterogeneous Biological Networks
Biological networks are often used to represent complex biological systems, which can contain several types of entities. Analysis and visualization of such networks is supported by the Cytoscape software tool and its many apps. While earlier versions of stringApp focused on providing intraspecies protein–protein interactions from the STRING database, the new stringApp 2.0 greatly improves the support for heterogeneous networks. Here, we highlight new functionality that makes it possible to create networks that contain proteins and interactions from STRING as well as other biological entities and associations from other sources. We exemplify this by complementing a published SARS-CoV-2 interactome with interactions from STRING. We have also extended stringApp with new data and query functionality for protein–protein interactions between eukaryotic parasites and their hosts. We show how this can be used to retrieve and visualize a cross-species network for a malaria parasite, its host, and its vector. Finally, the latest stringApp version has an improved user interface, allows retrieval of both functional associations and physical interactions, and supports group-wise enrichment analysis of different parts of a network to aid biological interpretation. stringApp is freely available at https://apps.cytoscape.org/apps/stringapp.
DOI: 10.1038/s41564-022-01286-7
2023
Cited 14 times
Chemotaxis and autoinducer-2 signalling mediate colonization and contribute to co-existence of Escherichia coli strains in the murine gut
Bacteria communicate and coordinate their behaviour at the intra- and interspecies levels by producing and sensing diverse extracellular small molecules called autoinducers. Autoinducer 2 (AI-2) is produced and detected by a variety of bacteria and thus plays an important role in interspecies communication and chemotaxis. Although AI-2 is a major autoinducer molecule present in the mammalian gut and can influence the composition of the murine gut microbiota, its role in bacteria-bacteria and bacteria-host interactions during gut colonization remains unclear. Combining competitive infections in C57BL/6 mice with microscopy and bioinformatic approaches, we show that chemotaxis (cheY) and AI-2 signalling (via lsrB) promote gut colonization by Escherichia coli, which is in turn connected to the ability of the bacteria to utilize fructoselysine (frl operon). We further show that the genomic diversity of E. coli strains with respect to AI-2 signalling allows ecological niche segregation and stable co-existence of different E. coli strains in the mammalian gut.
DOI: 10.1073/pnas.2136809100
2003
Cited 147 times
Genome evolution reveals biochemical networks and functional modules
The analysis of completely sequenced genomes uncovers an astonishing variability between species in terms of gene content and order. During genome history, the genes are frequently rear-ranged, duplicated, lost, or transferred horizontally between genomes. These events appear to be stochastic, yet they are under selective constraints resulting from the functional interactions between genes. These genomic constraints form the basis for a variety of techniques that employ systematic genome comparisons to predict functional associations among genes. The most powerful techniques to date are based on conserved gene neighborhood, gene fusion events, and common phylogenetic distributions of gene families. Here we show that these techniques, if integrated quantitatively and applied to a sufficiently large number of genomes, have reached a resolution which allows the characterization of function at a higher level than that of the individual gene: global modularity becomes detectable in a functional protein network. In Escherichia coli, the predicted modules can be bench-marked by comparison to known metabolic pathways. We found as many as 74% of the known metabolic enzymes clustering together in modules, with an average pathway specificity of at least 84%. The modules extend beyond metabolism, and have led to hundreds of reliable functional predictions both at the protein and pathway level. The results indicate that modularity in protein networks is intrinsically encoded in present-day genomes.
DOI: 10.1016/s0955-0674(03)00009-7
2003
Cited 133 times
Function prediction and protein networks
In the genomics era, the interactions between proteins are at the center of attention. Genomic-context methods used to predict these interactions have been put on a quantitative basis, revealing that they are at least on an equal footing with genomics experimental data. A survey of experimentally confirmed predictions proves the applicability of these methods, and new concepts to predict protein interactions in eukaryotes have been described. Finally, the interaction networks that can be obtained by combining the predicted pair-wise interactions have enough internal structure to detect higher levels of organization, such as 'functional modules'.
DOI: 10.1101/gr.3266405
2005
Cited 114 times
Complex genomic rearrangements lead to novel primate gene function
Orthologous genes that maintain a single-copy status in a broad range of species may indicate a selection against gene duplication. If this is the case, then duplicates of such genes that do survive may have escaped the dosage control by rapid and sizable changes in their function. To test this hypothesis and to develop a strategy for the identification of novel gene functions, we have analyzed 22 primate-specific intrachromosomal duplications of genes with a single-copy ortholog in all other completely sequenced metazoans. When comparing this set to genes not exposed to the single-copy status constraint, we observed a higher tendency of the former to modify their gene structure, often through complex genomic rearrangements. The analysis of the most dramatic of these duplications, affecting approximately 10% of human Chromosome 2, enabled a detailed reconstruction of the events leading to the appearance of a novel gene family. The eight members of this family originated from the highly conserved nucleoporin RanBP2 by several genetic rearrangements such as segmental duplications, inversions, translocations, exon loss, and domain accretion. We have experimentally verified that at least one of the newly formed proteins has a cellular localization different from RanBP2's, and we show that positive selection did act on specific domains during evolution.
DOI: 10.1038/msb.2011.7
2011
Cited 82 times
RNAi screen of <i>Salmonella</i> invasion shows role of COPI in membrane targeting of cholesterol and Cdc42
The pathogen Salmonella Typhimurium is a common cause of diarrhea and invades the gut tissue by injecting a cocktail of virulence factors into epithelial cells, triggering actin rearrangements, membrane ruffling and pathogen entry.One of these factors is SopE, a G-nucleotide exchange factor for the host cellular Rho GTPases Rac1 and Cdc42.How SopE mediates cellular invasion is incompletely understood.Using genome-scale RNAi screening we identified 72 known and novel host cell proteins affecting SopE-mediated entry.Follow-up assays assigned these 'hits' to particular steps of the invasion process; i.e., binding, effector injection, membrane ruffling, membrane closure and maturation of the Salmonella-containing vacuole.Depletion of the COPI complex revealed a unique effect on virulence factor injection and membrane ruffling.Both effects are attributable to mislocalization of cholesterol, sphingolipids, Rac1 and Cdc42 away from the plasma membrane into a large intracellular compartment.Equivalent results were obtained with the vesicular stomatitis virus.Therefore, COPI-facilitated maintenance of lipids may represent a novel, unifying mechanism essential for a wide range of pathogens, offering opportunities for designing new drugs.
DOI: 10.1111/j.1758-2229.2011.00323.x
2012
Cited 79 times
Bacterial anoxygenic photosynthesis on plant leaf surfaces
Summary The aerial surface of plants, the phyllosphere, is colonized by numerous bacteria displaying diverse metabolic properties that enable their survival in this specific habitat. Recently, we reported on the presence of microbial rhodopsin harbouring bacteria on the top of leaf surfaces. Here, we report on the presence of additional bacterial populations capable of harvesting light as a means of supplementing their metabolic requirements. An analysis of six phyllosphere metagenomes revealed the presence of a diverse community of anoxygenic phototrophic bacteria, including the previously reported methylobacteria, as well as other known and unknown phototrophs. The presence of anoxygenic phototrophic bacteria was also confirmed in situ by infrared epifluorescence microscopy. The microscopic enumeration correlated with estimates based on metagenomic analyses, confirming both the presence and high abundance of these microorganisms in the phyllosphere. Our data suggest that the phyllosphere contains a phylogenetically diverse assemblage of phototrophic species, including some yet undescribed bacterial clades that appear to be phyllosphere‐unique.
DOI: 10.1111/j.1462-2920.2011.02554.x
2011
Cited 78 times
Microbial rhodopsins on leaf surfaces of terrestrial plants
The above-ground surfaces of terrestrial plants, the phyllosphere, comprise the main interface between the terrestrial biosphere and solar radiation. It is estimated to host up to 10(26) microbial cells that may intercept part of the photon flux impinging on the leaves. Based on 454-pyrosequencing-generated metagenome data, we report on the existence of diverse microbial rhodopsins in five distinct phyllospheres from tamarisk (Tamarix nilotica), soybean (Glycine max), Arabidopsis (Arabidopsis thaliana), clover (Trifolium repens) and rice (Oryza sativa). Our findings, for the first time describing microbial rhodopsins from non-aquatic habitats, point towards the potential coexistence of microbial rhodopsin-based phototrophy and plant chlorophyll-based photosynthesis, with the different pigments absorbing non-overlapping fractions of the light spectrum.
DOI: 10.1038/nmeth.3101
2014
Cited 70 times
A sentinel protein assay for simultaneously quantifying cellular processes
DOI: 10.1073/pnas.1402353111
2014
Cited 64 times
Specific inhibition of diverse pathogens in human cells by synthetic microRNA-like oligonucleotides inferred from RNAi screens
Significance Pathogens can enter into human cells using a variety of specific mechanisms, often hitchhiking on naturally existing transport pathways. To uncover parts of the host machinery that are required for entry, scientists conduct infection screens in cultured cells. In these screens, human genes are systematically inactivated by short RNA oligos, designed to bind and inactivate mRNA molecules. Here, we show that many of these oligos additionally bind unintended mRNA targets as well, and that this effect overall dominates and complicates such screens. Focusing on the strong “off-target” signal, we design novel oligos that no longer bind any one gene specifically but nevertheless strongly and reproducibly block pathogen entry—pointing to pathogen/host interactions at a higher-order, pathway level.
DOI: 10.1093/database/baw167
2017
Cited 54 times
RAIN: RNA–protein Association and Interaction Networks
Protein association networks can be inferred from a range of resources including experimental data, literature mining and computational predictions. These types of evidence are emerging for non-coding RNAs (ncRNAs) as well. However, integration of ncRNAs into protein association networks is challenging due to data heterogeneity. Here, we present a database of ncRNA-RNA and ncRNA-protein interactions and its integration with the STRING database of protein-protein interactions. These ncRNA associations cover four organisms and have been established from curated examples, experimental data, interaction predictions and automatic literature mining. RAIN uses an integrative scoring scheme to assign a confidence score to each interaction. We demonstrate that RAIN outperforms the underlying microRNA-target predictions in inferring ncRNA interactions. RAIN can be operated through an easily accessible web interface and all interaction data can be downloaded.Database URL: http://rth.dk/resources/rain.
DOI: 10.1093/bioinformatics/btz959
2020
Cited 46 times
The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences
Abstract Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1093/nar/gkac1078
2022
Cited 21 times
proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes
Abstract The interpretation of genomic, transcriptomic and other microbial ‘omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/
DOI: 10.3389/fmicb.2022.715637
2022
Cited 17 times
The Evolution of Ecological Diversity in Acidobacteria
Acidobacteria occur in a large variety of ecosystems worldwide and are particularly abundant and highly diverse in soils. In spite of their diversity, only few species have been characterized to date which makes Acidobacteria one of the most poorly understood phyla among the domain Bacteria. We used a culture-independent niche modeling approach to elucidate ecological adaptations and their evolution for 4,154 operational taxonomic units (OTUs) of Acidobacteria across 150 different, comprehensively characterized grassland soils in Germany. Using the relative abundances of their 16S rRNA gene transcripts, the responses of active OTUs along gradients of 41 environmental variables were modeled using hierarchical logistic regression (HOF), which allowed to determine values for optimum activity for each variable (niche optima). By linking 16S rRNA transcripts to the phylogeny of full 16S rRNA gene sequences, we could trace the evolution of the different ecological adaptations during the diversification of Acidobacteria . This approach revealed a pronounced ecological diversification even among acidobacterial sister clades. Although the evolution of habitat adaptation was mainly cladogenic, it was disrupted by recurrent events of convergent evolution that resulted in frequent habitat switching within individual clades. Our findings indicate that the high diversity of soil acidobacterial communities is largely sustained by differential habitat adaptation even at the level of closely related species. A comparison of niche optima of individual OTUs with the phenotypic properties of their cultivated representatives showed that our niche modeling approach (1) correctly predicts those physiological properties that have been determined for cultivated species of Acidobacteria but (2) also provides ample information on ecological adaptations that cannot be inferred from standard taxonomic descriptions of bacterial isolates. These novel information on specific adaptations of not-yet-cultivated Acidobacteria can therefore guide future cultivation trials and likely will increase their cultivation success.
DOI: 10.1016/j.mcpro.2023.100640
2023
Cited 8 times
PaxDb 5.0: Curated Protein Quantification Data Suggests Adaptive Proteome Changes in Yeasts
The "Protein Abundances Across Organisms" database (PaxDb) is an integrative metaresource dedicated to protein abundance levels, in tissue-specific or whole-organism proteomes. PaxDb focuses on computing best-estimate abundances for proteins in normal/healthy contexts and expresses abundance values for each protein in "parts per million" in relation to all other protein molecules in the cell. The uniform data reprocessing, quality scoring, and integrated orthology relations have made PaxDb one of the preferred tools for comparisons between individual datasets, tissues, or organisms. In describing the latest version 5.0 of PaxDb, we particularly emphasize the data integration from various types of raw data and how we expanded the number of organisms and tissue groups as well as the proteome coverage. The current collection of PaxDb includes 831 original datasets from 170 species, including 22 Archaea, 81 Bacteria, and 67 Eukaryota. Apart from detailing the data update, we also present a comparative analysis of the human proteome subset of PaxDb against the two most widely used human proteome data resources: Human Protein Atlas and Genotype-Tissue Expression. Lastly, through our protein abundance data, we reveal an evolutionary trend in the usage of sulfur-containing amino acids in the proteomes of Fungi.
DOI: 10.1073/pnas.0702636104
2007
Cited 74 times
Quantitative assessment of protein function prediction from metagenomics shotgun sequences
To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.
DOI: 10.1371/journal.pbio.1000146
2009
Cited 67 times
The Hedgehog Signaling Pathway: Where Did It Come From?
The Hedgehog signaling pathway plays a crucial role in development and disease. Its putative origins in an ancient system involved in regulating bacterial lipid transport and homeostasis offers clues about how the pathway might work today.
DOI: 10.1002/pmic.200900414
2010
Cited 59 times
Shotgun proteomics data from multiple organisms reveals remarkable quantitative conservation of the eukaryotic core proteome
Genome-wide, absolute quantification of expressed proteins is not yet within reach for most eukaryotes. However, large numbers of MS-based protein identifications have been deposited in databases, together with information on the observation frequencies of each peptide spectrum ("spectral counts"). We have conducted a meta-analysis using several million peptide observations from five model eukaryotes, establishing a consistent, semi-quantitative analysis pipeline. By inferring and comparing protein abundances across orthologs, we observe: (i) the accuracy of spectral counting predictions increases with sampling depth and can rival that of direct biochemical measurements, (ii) the quantitative makeup of the consistently observed core proteome in eukaryotes is remarkably stable, with abundance correlations exceeding R(S)=0.7 at an evolutionary distance greater than 1000 million years, and (iii) some groups of proteins are more constrained than others. We argue that our observations reveal stabilizing selection: central parts of the eukaryotic proteome appear to be expressed at well-balanced, near-optimal abundance levels. This is consistent with our further observations that essential proteins show lower abundance variations than non-essential proteins, and that gene families that tend to undergo gene duplications are less well constrained than families that keep a single-copy status.
DOI: 10.1371/journal.pone.0040064
2012
Cited 58 times
High Confidence Prediction of Essential Genes in Burkholderia Cenocepacia
Background Essential genes are absolutely required for the survival of an organism. The identification of essential genes, besides being one of the most fundamental questions in biology, is also of interest for the emerging science of synthetic biology and for the development of novel antimicrobials. New antimicrobial therapies are desperately needed to treat multidrug-resistant pathogens, such as members of the Burkholderia cepacia complex. Methodology/Principal Findings We hypothesize that essential genes may be highly conserved within a group of evolutionary closely related organisms. Using a bioinformatics approach we determined that the core genome of the order Burkholderiales consists of 649 genes. All but two of these identified genes were located on chromosome 1 of Burkholderia cenocepacia. Although many of the 649 core genes of Burkholderiales have been shown to be essential in other bacteria, we were also able to identify a number of novel essential genes present mainly, or exclusively, within this order. The essentiality of some of the core genes, including the known essential genes infB, gyrB, ubiB, and valS, as well as the so far uncharacterized genes BCAL1882, BCAL2769, BCAL3142 and BCAL3369 has been confirmed experimentally in B. cenocepacia. Conclusions/Significance We report on the identification of essential genes using a novel bioinformatics strategy and provide bioinformatics and experimental evidence that the large majority of the identified genes are indeed essential. The essential genes identified here may represent valuable targets for the development of novel antimicrobials and their detailed study may shed new light on the functions required to support life.
DOI: 10.1128/jvi.00388-14
2014
Cited 48 times
Genome-Wide Small Interfering RNA Screens Reveal VAMP3 as a Novel Host Factor Required for Uukuniemi Virus Late Penetration
ABSTRACT The Bunyaviridae constitute a large family of enveloped animal viruses, many of which are important emerging pathogens. How bunyaviruses enter and infect mammalian cells remains largely uncharacterized. We used two genome-wide silencing screens with distinct small interfering RNA (siRNA) libraries to investigate host proteins required during infection of human cells by the bunyavirus Uukuniemi virus (UUKV), a late-penetrating virus. Sequence analysis of the libraries revealed that many siRNAs in the screens inhibited infection by silencing not only the intended targets but additional genes in a microRNA (miRNA)-like manner. That the 7-nucleotide seed regions in the siRNAs can cause a perturbation in infection was confirmed by using synthetic miRNAs (miRs). One of the miRs tested, miR-142-3p, was shown to interfere with the intracellular trafficking of incoming viruses by regulating the v-SNARE VAMP3, a strong hit shared by both siRNA screens. Inactivation of VAMP3 by the tetanus toxin led to a block in infection. Using fluorescence-based techniques in fixed and live cells, we found that the viruses enter VAMP3 + endosomal vesicles 5 min after internalization and that colocalization was maximal 15 min thereafter. At this time, LAMP1 was associated with the VAMP3 + virus-containing endosomes. In cells depleted of VAMP3, viruses were mainly trapped in LAMP1-negative compartments. Together, our results indicated that UUKV relies on VAMP3 for penetration, providing an indication of added complexity in the trafficking of viruses through the endocytic network. IMPORTANCE Bunyaviruses represent a growing threat to humans and livestock globally. Unfortunately, relatively little is known about these emerging pathogens. We report here the first human genome-wide siRNA screens for a bunyavirus. The screens resulted in the identification of 562 host cell factors with a potential role in cell entry and virus replication. To demonstrate the robustness of our approach, we confirmed and analyzed the role of the v-SNARE VAMP3 in Uukuniemi virus entry and infection. The information gained lays the basis for future research into the cell biology of bunyavirus infection and new antiviral strategies. In addition, by shedding light on serious caveats in large-scale siRNA screening, our experimental and bioinformatics procedures will be valuable in the comprehensive analysis of past and future high-content screening data.
DOI: 10.1093/bioinformatics/btt657
2013
Cited 48 times
HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences
Abstract Motivation: Nucleotide sequence data are being produced at an ever increasing rate. Clustering such sequences by similarity is often an essential first step in their analysis—intended to reduce redundancy, define gene families or suggest taxonomic units. Exact clustering algorithms, such as hierarchical clustering, scale relatively poorly in terms of run time and memory usage, yet they are desirable because heuristic shortcuts taken during clustering might have unintended consequences in later analysis steps. Results: Here we present HPC-CLUST, a highly optimized software pipeline that can cluster large numbers of pre-aligned DNA sequences by running on distributed computing hardware. It allocates both memory and computing resources efficiently, and can process more than a million sequences in a few hours on a small cluster. Availability and implementation: Source code and binaries are freely available at http://meringlab.org/software/hpc-clust/; the pipeline is implemented in C++ and uses the Message Passing Interface (MPI) standard for distributed computing. Contact: mering@imls.uzh.ch Supplementary Information: Supplementary data are available at Bioinformatics online.
DOI: 10.1038/ismej.2016.139
2016
Cited 40 times
A family of interaction-adjusted indices of community similarity
Interactions between taxa are essential drivers of ecological community structure and dynamics, but they are not taken into account by traditional indices of β diversity. In this study, we propose a novel family of indices that quantify community similarity in the context of taxa interaction networks. Using publicly available datasets, we assessed the performance of two specific indices that are Taxa INteraction-Adjusted (TINA, based on taxa co-occurrence networks), and Phylogenetic INteraction-Adjusted (PINA, based on phylogenetic similarities). TINA and PINA outperformed traditional indices when partitioning human-associated microbial communities according to habitat, even for extremely downsampled datasets, and when organising ocean micro-eukaryotic plankton diversity according to geographical and physicochemical gradients. We argue that interaction-adjusted indices capture novel aspects of diversity outside the scope of traditional approaches, highlighting the biological significance of ecological association networks in the interpretation of community similarity.
DOI: 10.1016/j.molcel.2019.04.021
2019
Cited 34 times
Analysis of the Human Kinome and Phosphatome by Mass Cytometry Reveals Overexpression-Induced Effects on Cancer-Related Signaling
Kinase and phosphatase overexpression drives tumorigenesis and drug resistance. We previously developed a mass-cytometry-based single-cell proteomics approach that enables quantitative assessment of overexpression effects on cell signaling. Here, we applied this approach in a human kinome- and phosphatome-wide study to assess how 649 individually overexpressed proteins modulated cancer-related signaling in HEK293T cells in an abundance-dependent manner. Based on these data, we expanded the functional classification of human kinases and phosphatases and showed that the overexpression effects include non-catalytic roles. We detected 208 previously unreported signaling relationships. The signaling dynamics analysis indicated that the overexpression of ERK-specific phosphatases sustains proliferative signaling. This suggests a phosphatase-driven mechanism of cancer progression. Moreover, our analysis revealed a drug-resistant mechanism through which overexpression of tyrosine kinases, including SRC, FES, YES1, and BLK, induced MEK-independent ERK activation in melanoma A375 cells. These proteins could predict drug sensitivity to BRAF-MEK concurrent inhibition in cells carrying BRAF mutations.
DOI: 10.1038/s41586-022-05599-9
2023
Cited 6 times
Author Correction: Analyses of non-coding somatic drivers in 2,658 cancer whole genomes
DOI: 10.1038/s41467-024-45653-w
2024
Identification of HDV-like theta ribozymes involved in tRNA-based recoding of gut bacteriophages
Abstract Trillions of microorganisms, collectively known as the microbiome, inhabit our bodies with the gut microbiome being of particular interest in biomedical research. Bacteriophages, the dominant virome constituents, can utilize suppressor tRNAs to switch to alternative genetic codes (e.g., the UAG stop-codon is reassigned to glutamine) while infecting hosts with the standard bacterial code. However, what triggers this switch and how the bacteriophage manipulates its host is poorly understood. Here, we report the discovery of a subgroup of minimal hepatitis delta virus (HDV)-like ribozymes – theta ribozymes – potentially involved in the code switch leading to the expression of recoded lysis and structural phage genes. We demonstrate their HDV-like self-scission behavior in vitro and find them in an unreported context often located with their cleavage site adjacent to tRNAs, indicating a role in viral tRNA maturation and/or regulation. Every fifth associated tRNA is a suppressor tRNA, further strengthening our hypothesis. The vast abundance of tRNA-associated theta ribozymes – we provide 1753 unique examples – highlights the importance of small ribozymes as an alternative to large enzymes that usually process tRNA 3’-ends. Our discovery expands the short list of biological functions of small HDV-like ribozymes and introduces a previously unknown player likely involved in the code switch of certain recoded gut bacteriophages.
DOI: 10.1038/s41559-024-02357-0
2024
A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer
Horizontal gene transfer, the exchange of genetic material through means other than reproduction, is a fundamental force in prokaryotic genome evolution. Genomic persistence of horizontally transferred genes has been shown to be influenced by both ecological and evolutionary factors. However, there is limited availability of ecological information about species other than the habitats from which they were isolated, which has prevented a deeper exploration of ecological contributions to horizontal gene transfer. Here we focus on transfers detected through comparison of individual gene trees to the species tree, assessing the distribution of gene-exchanging prokaryotes across over a million environmental sequencing samples. By analysing detected horizontal gene transfer events, we show distinct functional profiles for recent versus old events. Although most genes transferred are part of the accessory genome, genes transferred earlier in evolution tend to be more ubiquitous within present-day species. We find that co-occurring, interacting and high-abundance species tend to exchange more genes. Finally, we show that host-associated specialist species are most likely to exchange genes with other host-associated specialist species, whereas species found across different habitats have similar gene exchange rates irrespective of their preferred habitat. Our study covers an unprecedented scale of integrated horizontal gene transfer and environmental information, highlighting broad eco-evolutionary trends.