ϟ
 
DOI: 10.1038/msb.2009.47
¤ OpenAccess: Gold
This work has “Gold” OA status. This means it is published in an Open Access journal that is indexed by the DOAJ.

Pathway databases and tools for their exploitation: benefits, current limitations and challenges

Anna Bauer‐Mehren,Laura I. Furlong,Ferrán Sanz

Biology
Computational biology
Current (fluid)
2009
Perspective28 July 2009Open Access Pathway databases and tools for their exploitation: benefits, current limitations and challenges Anna Bauer-Mehren Anna Bauer-Mehren Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Laura I Furlong Corresponding Author Laura I Furlong Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Ferran Sanz Ferran Sanz Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Anna Bauer-Mehren Anna Bauer-Mehren Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Laura I Furlong Corresponding Author Laura I Furlong Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Ferran Sanz Ferran Sanz Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Author Information Anna Bauer-Mehren1, Laura I Furlong 1 and Ferran Sanz1 1Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain *Corresponding author. Research Unit on Biomedical Informatics, Universitat Pompeu Fabra, IMIM-Hospital del Mar, PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain. Tel.: +34 9331 60521; Fax: +34 9331 60550; E-mail: [email protected] Molecular Systems Biology (2009)5:290https://doi.org/10.1038/msb.2009.47 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions Figures & Info In past years, comprehensive representations of cell signalling pathways have been developed by manual curation from literature, which requires huge effort and would benefit from information stored in databases and from automatic retrieval and integration methods. Once a reconstruction of the network of interactions is achieved, analysis of its structural features and its dynamic behaviour can take place. Mathematical modelling techniques are used to simulate the complex behaviour of cell signalling networks, which ultimately sheds light on the mechanisms leading to complex diseases or helps in the identification of drug targets. A variety of databases containing information on cell signalling pathways have been developed in conjunction with methodologies to access and analyse the data. In principle, the scenario is prepared to make the most of this information for the analysis of the dynamics of signalling pathways. However, are the knowledge repositories of signalling pathways ready to realize the systems biology promise? In this article we aim to initiate this discussion and to provide some insights on this issue. Introduction The past decades of research have led to a better understanding of the processes involved in cell signalling. Cell signalling refers to the biochemical processes using which cells respond to cues in their internal or external environment (Alberts et al, 2007). With the advent of high throughput experimentation, the identification and characterization of the molecular components involved in cell signalling became possible in a systematic way. In addition, the discovery of the connections between each of these components promoted the reconstruction of the chain of reactions, which subsequently gives rise to a signalling pathway. Ultimately, our ability to interpret the function and regulation of cell signalling pathways is crucial for understanding the ways in which cells respond to external cues and how they communicate with each other. In this regard, the systematic collection of pathway information in the form of pathway databases and the application of mathematical analysis for pathway modelling are crucial. Several databases containing information on cell signalling pathways have been developed in conjunction with methodologies to access and analyse the data (Suderman and Hallett, 2007). Furthermore, mathematical modelling emerged as a solution to study the complex behaviour of networks (Alves et al, 2006; Fisher and Henzinger, 2007; Karlebach and Shamir, 2008). The models, so far obtained, allow formulating hypothesis that can be tested in the laboratory. Iterative cycles of prediction and experimental verification have resulted in the refinement of our knowledge of cell signalling, and have shed light on different aspects of cell signalling at a systems level (regulatory aspects, such as feedback control circuits or architectural features, such as modularity). Furthermore, signalling cascades are not isolated units within the cell, but form part of a mesh of interconnected networks through which the signal elicited by an environmental cue can traverse (Yaffe, 2008). Ultimately, each cell is exposed to a variety of signalling cues, and the specificity of the response will be determined by the signalling mechanisms that are activated by the cue (Alberts et al, 2007). Recent research highlights the importance of the, so called, crosstalks between pathways, such as the recently published connections between signalling through the purinergic receptors and the Ca2+ sensing (Chaumont et al, 2008); the link between extracellular glycocalyx structure and nitric oxide signalling pathway (Tarbell and Ebong, 2008); the interactions between insulin and epidermal growth factor signalling (Borisov et al, 2009) and the crosstalk between phosphoinositide 3 kinase and Ras/extracellular signal-regulated kinase signalling pathways (Wang et al, 2009). An important goal of this research is to achieve a reconstruction of the network of interactions that gives rise to a signalling pathway in a biologically consistent and meaningful manner that in turn allows the mathematical analysis of the emerging properties of the network. In this regard, comprehensive maps of signalling pathways have been developed by manual curation from literature (Oda et al, 2005; Oda and Kitano, 2006; Calzone et al, 2008). Building such reference maps requires huge effort and would benefit from information stored in databases and from automatic retrieval and integration methods. Once a reconstruction of the network of interactions is achieved, analysis of the structural features of the network and its dynamic behaviour can take place. A commonly seen architecture of signalling pathways is called ‘bow-tie’, in which many input and output signals are handled by a common layer constituted by a small number of conserved components. This network architecture provides robustness and flexibility to a variety of external cues due to the redundancy of reactions that are part of the input and output layers (Kitano, 2007a). Robustness refers to the ability of an organism to compensate the effects of perturbations to maintain the organism's functions (Kitano, 2007b). Such perturbations can be changes in the availability of nutrients as well as the presence of mutagens or toxins. Moreover, systems can be subjected to functional disruptions when facing perturbations for which they are not optimized, thus showing points of fragility of the biological system (Kitano, 2007b). For instance, an undesired effect of a drug can be caused by the unwanted interaction of the drug with molecules that represent points of fragility of the physiological system (Kitano, 2007a). In contrast, drugs can be completely ineffective when the robustness of the system compensates their action. It has been suggested that crosstalks between signalling pathways contribute to the robustness of cells against perturbations (Kitano, 2007a). In addition, the points of fragility of the system are sometimes exploited by pathogens causing diseases, or represent processes that are usually found to malfunction in particular diseases, such as cancer. Diseases that arise from dysfunction in cell signalling are usually not attributed to a single gene but to the failure of emerging control mechanisms in the network. It has been reported that the loss of negative feedback loops characterizes solid tumours (Amit et al, 2007). These diseases are difficult to diagnose and treat unless accurate understanding of the underlying principles regulating the system is in place. Thus, the interpretation of the global properties of signalling pathways has important implications for the elucidation of the mechanisms that lead to complex diseases, and also for the identification of drug targets. At present, there are several repositories of information on cell signalling pathways that cover a wide range of signal transduction mechanisms and include high quality data in terms of annotation and cross references to biological databases. In principle, the scenario is prepared to make use of the information for the analysis of the behaviour of the signalling pathways. Thus, are the knowledge repositories on signalling pathways ready to realize the systems biology promise? In this article, we aim to initiate this discussion and to provide some insights on this issue. First, we present an analytical overview of current pathway databases (see Pathway databases). In section ‘Case study: EGFR signalling’, we present the results of an evaluation exercise conducted to determine the accuracy and completeness of current pathway databases in front of an expert-curated pathway used as ‘gold standard’. Moreover, we propose a strategy for the use of pathway data from public databases for network modelling (Box 1; Table I). Finally, in the section ‘Conclusions and perspectives’ we discuss the strengths and limitations of the current pathway databases and their usefulness in practical biological problems and applications. Box 1 Use of data from public pathway databases for modelling purposes Box 1 Most public, available pathway databases offer their data in BioPAX format, which was developed for detailed pathway representation and as data exchange format. For storing and sharing of computational models of biological networks, SBML has emerged as standard and is supported by most modelling software. BioPAX and SBML, the two main standards for the representation of biological networks, have been discussed in detail by others (Stromback and Lambrix, 2005; Stromback et al, 2006). In Table I, we briefly list the most important features of the SBML and BioPAX standards. A scenario in which pathway data were directly used for network modelling is proposed here. One or more pathways represented in BioPAX format are automatically retrieved from different databases and imported into a pathway visualization and analysis tool. Then, integration of the different pathways can take place to obtain a comprehensive and biologically meaningful representation of the network. In addition, annotations can be added if required or structural analysis of the network can be carried out. The resulting network, which integrates the original pathways retrieved from the databases, is exported to SBML format and subjected to modelling. If a quantitative approach is chosen, additional information, such as rate constants are required to start the modelling process. In this process, conversion between the two formats is required to achieve inter-operability between pathway and model representations. Some solutions are already available. The BioModels (http://www.ebi.ac.uk/biomodels-main/) database, which contains a variety of curated models in SBML format, offers conversion to BioPAX format. The opposite conversion, from BioPAX to SBML, would open the possibility of modelling the pathways stored in public databases. However, the inter-conversion between BioPAX and SBML is not trivial as both formats where developed for different purposes. BioPAX, for instance, does not offer the possibility to store quantitative information needed for kinetic modelling, whereas SBML does not represent relationships between nodes that are not needed for modelling and that are present in BioPAX. Examples of approaches for the conversion from BioPAX to SBML are BiNoM (Zinovyev et al, 2008), which is available as Cytoscape plugin, and SyBil, which is part of the model environment for quantitative modelling VCell (Evelo, 2009). Although compatibility of different pathway and network model exchange formats is still not completely achieved, the efforts made towards this goal represent significant contributions to pathway retrieval, integration and subsequent modelling. Table 1. Comparison between SBML and BioPAX SBML BioPAX Representation format XML (Extensible Markup Language) OWL (Web Ontology Language), XML Main purpose Representation of computational models of biological networks Pathway description with all details on reactions, components, information on cellular location etc. Entities and reactions Based on species and reactions (Hucka et al., 2003): Basic ontology based on three classes (http://www.biopax.org/): Species (proteins, small molecules etc.) Pathway (set of interactions) Reactions (how species interact) Physical entity with subclasses, such as RNA, DNA, protein, complex and small molecules Compartment (in which interactions take place) Interaction with subclasses, such as conversion having biochemicalReaction as subclass, etc. Number of pathways represented One model per SBML file Several pathways per BioPAX file possible (each object has its own RDF id and is hence uniquely identifiable) Reaction kinetics Allows representation of kinetics, including parameters for reaction rates, initial concentrations etc. No kinetics as BioPAX is not meant for modelling but pathway representation Levels Built in levels with different versions. Each level adds new features, such as the incorporation of controlled vocabularies. At the time of writing, the most stable version is SBML Level 2 BioPAX Level 1: representation of chemical reactions involved in metabolism BioPAX Level 2: adds molecular interactions and protein post-translational modifications BioPAX Level 3: any kind of biological reaction, including regulation of gene expression (BioPAX L3 is at the time of writing still in release process) The BioPAX project roadmap envisages two additional levels capturing interactions at the cellular level. (http://www.biopax.org/Docs/BioPAX_Roadmap.html) Pathway database support Reactome Reactome KEGG KEGG (only BioPAX Level 1) PID PathwayCommons Model database support BioModels BioModels (conversion from SBML to BioPAX possible) Library for reading/writing libSBML(Bornstein et al, 2008) Paxtools (http://www.biopax.org/paxtools/) Software support Standard modelling software, such as CellDesigner or Copasi (Hoops et al, 2006) Network visualization software, such as Cytoscape or VisANT Network visualization software, such as Cytoscape Pathway databases Pathway databases serve as repositories of current knowledge on cell signalling. They present pathways in a graphical format comparable to the representation in text books, as well as in standard formats allowing exchange between different software platforms and further processing by network analysis, visualization and modelling tools. At present, there exist a vast variety of databases containing biochemical reactions, such as signalling pathways or protein–protein interactions. The Pathguide resource serves as a good overview of current pathway databases (Bader et al, 2006). More than 200 pathway repositories are listed, from which over 60 are specialized on reactions in human. However, only half of them provide pathways and reactions in computer-readable formats needed for automatic retrieval and processing, and even less support standard formats, such as Biological Pathway Exchange (BioPAX) (http://www.biopax.org) and Systems Biology Markup Language (SBML) (Hucka et al, 2003). To obtain a complete view of the biological process of interest, combination of information from diverse reactions and pathways is often needed. A recent publication (Adriaens et al, 2008), describes a workflow developed for gathering and curating all information on a pathway to obtain a broad and correct representation. However, the described process heavily relies on manual intervention. Consequently, there is a need for the automation of both the pathway retrieval process and the integration of different data sources. This section is devoted to the description of main pathway databases: Reactome, Kyoto Encyclopedia of Genes and Genomes (KEGG), WikiPathways, Nature Pathway Interaction Database (PID) and Pathway Commons. Table II lists all pathway databases and protein–protein interaction resources that are mentioned in this section. Table 2. Online pathway and protein–protein interaction (PPI) databases Pathway/PPI database Web link Standard exchange formats for download Web service API Reactome http://www.reactome.org BioPAX Level 2 SOAP web service API BioPAX Level 3 (only some reactions) Detailed user manual available, example client in Java SBML Level 2 KEGG http://www.genome.jp/kegg/pathway.html KGML (default format) SOAP web service API BioPAX Level 1 (only metabolic reactions) Example client in Java, Ruby, Perl SBML (using converter) Direct import into Cytoscape GPML (using converter) WikiPathways http://www.wikipathways.org GPML (default format) SOAP web service API Converters to standards, such as SBML and BioPAX are in progress Example clients in Java, Perl, Python, R NCI/Nature Pathway Interaction Database (PID) http://pid.nci.nih.gov PID XML (default format) Access through Pathway Commons BioPAX Level 2 BioCarta http://www.biocarta.com BioPAX Level 2 through NCI/ Nature Pathway Interaction Database (PID) Pathway commons http://www.pathwaycommons.org BioPAX Level 2 (default format for pathways) HTTP URL-based XML web service through cPath PSI-MI (default format for protein–protein interactions) Direct import into Cytoscape Cancer cell map http://cancer.cellmap.org BioPAX Level 2 HTTP URL-based XML web service via cPath HumanCyc http://humancyc.org BioPAX Level 2 Access through Pathway Commons and Pathway Tools (Karp et al, 2002) BioPAX Level 3 IntAct www.ebi.ac.uk/intact/ PSI-MI Access through Pathway Commons HPRD http://www.hprd.org PSI-MI Access through Pathway Commons MINT http://mint.bio.uniroma2.it/mint/ PSI-MI Access through Pathway Commons Reactome Reactome is currently one of the most complete and best-curated pathway databases. It covers reactions for any type of biological process and organizes them in a hierarchal manner. In this hierarchy, the lower level corresponds to single reactions, whereas the upper level represents the pathway as a whole. Reactome was first developed as an open source database for pathways and interactions in human. Equivalent reactions for other species are inferred from the human data (Vastrik et al, 2007), providing coverage to 22 non-human species, including mouse, rat, chicken, puffer fish, worm, fly, yeast, and Escherichia coli. Furthermore, other Reactome projects exist focusing on single species, such as the Arabidopsis Reactome (http://www.arabidopsisreactome.org). All pathway and reaction data in Reactome are extracted from biomedical experiments and literature. For this purpose, PhD-level biologists are invited to work together with the Reactome curators and editors on the curation of data on selected biological processes. Once the first outline of the biological process is created and annotated, it is inspected by peer reviewers and potential inconsistencies and errors are fixed. Every two years the data are reviewed to keep it updated (Joshi-Tope et al, 2005; Matthews et al, 2009). Moreover, cross references to different databases, such as UniProt (The UniProt Consortium, 2008), Ensembl (http://www.ensembl.org/index.html), NCBI (http://www.ncbi.nlm.nih.gov), Gene Ontology (GO) (Ashburner et al, 2000), Entrez Gene (Maglott et al, 2007), UCSC Genome Browser (http://genome.ucsc.edu), HapMap (http://www.hapmap.org), PubMed, as well as to other pathway databases, such as KEGG (Kanehisa and Goto, 2000) are provided. Pathways are presented as chains of chemical reactions and the same data model is used to describe reactions for any biological process, such as transcription, catalysis or binding (Matthews et al, 2007). Altogether, this represents a coherent view of pathway knowledge. The data model is based on classes, such as physical entity or event. Physical entities comprise proteins, DNA, RNA, small molecules but also complexes of single entities. Proteins, RNA and DNA, for which the sequence is known, are linked to the appropriate databases. Chemical entities such as small molecules are linked to ChEBI (http://www.ebi.ac.uk/chebi/init.do). An event can be either a ReactionLikeEvent, which represents reactions that convert an input into an output, or a PathwayLikeEvent, grouping together several related events. Each class possesses properties, such as information on the type of interaction (e.g. inhibition or activation). Reactome explicitly considers the different states an entity can show in a reaction. The phosphorylated and the unphosphorylated version of a protein are, for example, represented as separate entities. In addition, generalization is allowed. This means that if two different entities have exactly the same function in a reaction, such as isoenzymes, the reaction is only described once and the functional equivalent entities belong to the same defined set. Another interesting element of the Reactome data model is the use of candidate sets, which act as placeholders for all possible entities in a reaction, in case the exact entity involved in the reaction is not yet known. Reactome can either be directly browsed or queried by text search using, for instance, UniProt accession numbers. In addition, some tools for advanced queries are provided. The PathFinder tool allows connecting an input to an output molecule or event by constructing the shortest path between both. The SkyPainter tool can be used to identify events or pathways that are statistically over-represented for a list of genes or proteins. Moreover, Reactome data can be combined with other databases such as UniProt, by using the Reactome BioMart (http://www.biomart.org) tool. In addition to browsing pathways through the Reactome web interface, it is possible to download the data for local visualization and analysis using other tools. Different formats are provided for pathway download, including SBML Level 2, and BioPAX Level 2 and Level 3 (for some reactions only), as well as graphical formats. Pathway files, for instance, in BioPAX format can be directly opened in Cytoscape (Shannon et al, 2003), a software for the visualization and analysis of networks. Moreover, data can be programmatically accessed through a SOAP web service. KEGG KEGG is not only a database for pathways but consists of 19 highly interconnected databases, containing genomic, chemical and phenotypic information (Kanehisa and Goto, 2000; Kanehisa et al, 2008). Here we concentrate on the database storing biological pathways. KEGG categorizes its pathways into metabolic processes, genetic information processing, environmental information processing, including signalling pathways, cellular processes, information on human diseases and drug development. However, the best-organized and most complete information can be found for metabolic pathways. KEGG is not organism specific but covers a wide range of organisms, including human. The pathways are manually curated by experts using literature. In addition to the interconnection of all databases underlying KEGG, links to external databases, such as NCBI Entrez Gene, OMIM, UniProt and GO are provided. Pathways can either be browsed or queried by free text search. The user can search for gene names, chemical compounds or whole pathways. A tutorial on how to browse pathways in KEGG and an overview of the multiple representation formats is available (Aoki-Kinoshita & Minoru Kanehisa, 2007). Each pathway stored in KEGG can be downloaded in its own XML format named KGML, which is supported by VisANT, a software tool for pathway visualization (Hu et al, 2008b) and indirectly by Cytoscape using scripting plugins. In addition, metabolic pathways are available in BioPAX Level 1, which was especially designed for metabolic reactions, as well as in SBML. For converting KEGG metabolic pathways to SBML, a tool called KEGG2SBML (http://sbml.org/Software/KEGG2SBML) was developed. KEGG data can also be accessed using the KEGG API or KEGG FTP. Moreover, for making use of the KEGG resources, several applications exist. KegArray, for example, allows the analysis of microarray data in the context of KEGG pathways. WikiPathways A recently developed resource for pathway information that strongly differs from other pathway repositories is WikiPathways. WikiPathways is an open source project based, like Wikipedia, on the MediaWiki software (Pico et al, 2008). It serves as an open and collaborative platform for creation, edition and curation of biological pathways in different species. WikiPathways aims to achieve a public commitment to pathway storage and curation by keeping pathway creation and curation processes simple. Although the curation process of the previously described databases is subjected to experts, any user with an account on WikiPathways can create new pathways, and edit already existing ones. The pathway entities are linked to reference databases, based on the criteria provided by the editor. Hence, the identifiers depend on the chosen reference database and can therefore differ between pathways and even within a single pathway. Pathways in WikiPathways can be browsed by species and categories, for example, Metabolic Process. They can also be searched using gene, protein or pathway name or any free text query. In addition, pathways can be programmatically accessed through a web service (http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice). For pathway data exchange, WikiPathways does not use standard formats like BioPAX or SBML, but offers a much simpler representation called GenMAPP Pathway Markup Language (GPML) that is compatible with visualization and analysis tools, such as Cytoscape, GenMAPP (Salomonis et al, 2007) and PathVisio (van Iersel et al, 2008). The use of GPML is in agreement with the community annotation nature of the project, as it offers a simple pathway representation and several functionalities for building network diagrams. However, inter-operability with other pathway databases is impeded, and substantial efforts towards combining WikiPathways with the other pathway repositories will be required. In this regard, some approaches with the objective of conversion between GPML and standard pathway exchange formats, such as SBML and BioPAX, are under development (Evelo, 2009). In addition, KEGG pathways in KGML format are also available in GPML format ready for download (http://www.pathvisio.org/Download#Step_3) or can be converted into GPML (http://www.bigcat.unimaas.nl/tracprojects/pathvisio/wiki/KeggConverter). The exponential growth of biological data poses a challenge to the high-quality annotation and curation of databases. In this scenario, the use of wikis for community curation of biological data have emerged in the past years with the goal of increasing quality of data annotation by combining knowledge from multiple experts (Giles, 2007; Waldrop, 2008; Hu et al, 2008a). However, their success will strongly depend on the commitment of the community and WikiPathways authors claim that the initiative represents an experiment, in which the ‘community curation’ approach is being tested (Pico et al, 2008). Thus, WikiPathways can be seen as a complementary and enhancing source of information for the major pathway databases, like Reactome or KEGG. In contrast to the aforementioned databases, the systems described below combine diverse pathway repositories, and can be seen as first attempts towards the integration of pathway information from various sources. Nature pathway interaction database PID contains data on cell signalling in humans (Schaefer et al, 2009). PID combines three different sources: the NCI-curated pathways that are obtained from peer reviewed literature, as well as pathways imported from Reactome and BioCarta. Similar to Reactome, PID structures pathways hierarchically into pathways and their sub-pathways that are called sub-networks in PID. The PID data model is based on molecular interactions in which input biomolecules are transformed into output biomolecules. Each process can be promoted or inhibited by regulators. Biomolecules are proteins, RNA, complexes or small molecules. DNA is not a part of the PID data model and only output RNA and regulator are represented in transcriptional processes. Each protein is cross-referenced to UniProt, RNA to Entrez Gene, small molecules to the Chemical Abstracts Service (CAS) registry number and complexes are annotated using GO terms. Different states of biomolecules, such as ‘active/inactive’ or ‘phosphorylated’ are part of the annotations of the biomolecule. Cellular location
Loading...
    Cite this:
Generate Citation
Powered by Citationsy*
    Pathway databases and tools for their exploitation: benefits, current limitations and challenges” is a paper by Anna Bauer‐Mehren Laura I. Furlong Ferrán Sanz published in 2009. It has an Open Access status of “gold”. You can read and download a PDF Full Text of this paper here.