ϟ

Anna Bauer‐Mehren

Here are all the papers by Anna Bauer‐Mehren that you can download and read on OA.mg.
Anna Bauer‐Mehren’s last known institution is . Download Anna Bauer‐Mehren PDFs here.

Claim this Profile →
DOI: 10.1093/database/bav028
2015
Cited 865 times
DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes
DisGeNET is a comprehensive discovery platform designed to address a variety of questions concerning the genetic underpinning of human diseases. DisGeNET contains over 380 000 associations between >16 000 genes and 13 000 diseases, which makes it one of the largest repositories currently available of its kind. DisGeNET integrates expert-curated databases with text-mined data, covers information on Mendelian and complex diseases, and includes data from animal disease models. It features a score based on the supporting evidence to prioritize gene-disease associations. It is an open access resource available through a web interface, a Cytoscape plugin and as a Semantic Web resource. The web interface supports user-friendly data exploration and navigation. DisGeNET data can also be analysed via the DisGeNET Cytoscape plugin, and enriched with the annotations of other plugins of this popular network analysis software suite. Finally, the information contained in DisGeNET can be expanded and complemented using Semantic Web technologies and linked to a variety of resources already present in the Linked Data cloud. Hence, DisGeNET offers one of the most comprehensive collections of human gene-disease associations and a valuable set of tools for investigating the molecular mechanisms underlying diseases of genetic origin, designed to fulfill the needs of different user profiles, including bioinformaticians, biologists and health-care practitioners. Database URL: http://www.disgenet.org/
DOI: 10.1371/journal.pone.0124653
2015
Cited 268 times
Proton Pump Inhibitor Usage and the Risk of Myocardial Infarction in the General Population
Background and Aims Proton pump inhibitors (PPIs) have been associated with adverse clinical outcomes amongst clopidogrel users after an acute coronary syndrome. Recent pre-clinical results suggest that this risk might extend to subjects without any prior history of cardiovascular disease. We explore this potential risk in the general population via data-mining approaches. Methods Using a novel approach for mining clinical data for pharmacovigilance, we queried over 16 million clinical documents on 2.9 million individuals to examine whether PPI usage was associated with cardiovascular risk in the general population. Results In multiple data sources, we found gastroesophageal reflux disease (GERD) patients exposed to PPIs to have a 1.16 fold increased association (95% CI 1.09–1.24) with myocardial infarction (MI). Survival analysis in a prospective cohort found a two-fold (HR = 2.00; 95% CI 1.07–3.78; P = 0.031) increase in association with cardiovascular mortality. We found that this association exists regardless of clopidogrel use. We also found that H2 blockers, an alternate treatment for GERD, were not associated with increased cardiovascular risk; had they been in place, such pharmacovigilance algorithms could have flagged this risk as early as the year 2000. Conclusions Consistent with our pre-clinical findings that PPIs may adversely impact vascular function, our data-mining study supports the association of PPI exposure with risk for MI in the general population. These data provide an example of how a combination of experimental studies and data-mining approaches can be applied to prioritize drug safety signals for further investigation.
DOI: 10.1038/clpt.2013.24
2013
Cited 233 times
Performance of Pharmacovigilance Signal-Detection Algorithms for the FDA Adverse Event Reporting System
Signal-detection algorithms (SDAs) are recognized as vital tools in pharmacovigilance. However, their performance characteristics are generally unknown. By leveraging a unique gold standard recently made public by the Observational Medical Outcomes Partnership (OMOP) and by conducting a unique systematic evaluation, we provide new insights into the diagnostic potential and characteristics of SDAs that are routinely applied to the US Food and Drug Administration (FDA) Adverse Event Reporting System (AERS). We find that SDAs can attain reasonable predictive accuracy in signaling adverse events. Two performance classes emerge, indicating that the class of approaches that address confounding and masking effects benefits safety surveillance. Our study shows that not all events are equally detectable, suggesting that specific events might be monitored more effectively using other data sources. We provide performance guidelines for several operating scenarios to inform the trade-off between sensitivity and specificity for specific use cases. We also propose an approach and demonstrate its application in identifying optimal signaling thresholds, given specific misclassification tolerances. Clinical Pharmacology & Therapeutics (2013); 93 6, 539–546. doi:10.1038/clpt.2013.24
DOI: 10.1093/bioinformatics/btq538
2010
Cited 189 times
DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene–disease networks
DisGeNET is a plugin for Cytoscape to query and analyze human gene-disease networks. DisGeNET allows user-friendly access to a new gene-disease database that we have developed by integrating data from several public sources. DisGeNET permits queries restricted to (i) the original data source, (ii) the association type, (iii) the disease class or (iv) specific gene(s)/disease(s). It represents gene-disease associations in terms of bipartite graphs and provides gene centric and disease centric views of the data. It assists the user in the interpretation and exploration of the genetic basis of human diseases by a variety of built-in functions. Moreover, DisGeNET permits multicolouring of nodes (genes/diseases) according to standard disease classification for expedient visualization.DisGeNET is compatible with Cytoscape 2.6.3 and 2.7.0, please visit http://ibi.imim.es/DisGeNET/DisGeNETweb.html for installation guide, user tutorial and download.
DOI: 10.1038/msb.2009.47
2009
Cited 177 times
Pathway databases and tools for their exploitation: benefits, current limitations and challenges
Perspective28 July 2009Open Access Pathway databases and tools for their exploitation: benefits, current limitations and challenges Anna Bauer-Mehren Anna Bauer-Mehren Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Laura I Furlong Corresponding Author Laura I Furlong Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Ferran Sanz Ferran Sanz Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Anna Bauer-Mehren Anna Bauer-Mehren Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Laura I Furlong Corresponding Author Laura I Furlong Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Ferran Sanz Ferran Sanz Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain Search for more papers by this author Author Information Anna Bauer-Mehren1, Laura I Furlong 1 and Ferran Sanz1 1Research Unit on Biomedical Informatics (GRIB), IMIM-Hospital del Mar, Universitat Pompeu Fabra, Barcelona Biomedical Research Park, Dr Aiguader 88, Barcelona, Spain *Corresponding author. Research Unit on Biomedical Informatics, Universitat Pompeu Fabra, IMIM-Hospital del Mar, PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain. Tel.: +34 9331 60521; Fax: +34 9331 60550; E-mail: [email protected] Molecular Systems Biology (2009)5:290https://doi.org/10.1038/msb.2009.47 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions Figures & Info In past years, comprehensive representations of cell signalling pathways have been developed by manual curation from literature, which requires huge effort and would benefit from information stored in databases and from automatic retrieval and integration methods. Once a reconstruction of the network of interactions is achieved, analysis of its structural features and its dynamic behaviour can take place. Mathematical modelling techniques are used to simulate the complex behaviour of cell signalling networks, which ultimately sheds light on the mechanisms leading to complex diseases or helps in the identification of drug targets. A variety of databases containing information on cell signalling pathways have been developed in conjunction with methodologies to access and analyse the data. In principle, the scenario is prepared to make the most of this information for the analysis of the dynamics of signalling pathways. However, are the knowledge repositories of signalling pathways ready to realize the systems biology promise? In this article we aim to initiate this discussion and to provide some insights on this issue. Introduction The past decades of research have led to a better understanding of the processes involved in cell signalling. Cell signalling refers to the biochemical processes using which cells respond to cues in their internal or external environment (Alberts et al, 2007). With the advent of high throughput experimentation, the identification and characterization of the molecular components involved in cell signalling became possible in a systematic way. In addition, the discovery of the connections between each of these components promoted the reconstruction of the chain of reactions, which subsequently gives rise to a signalling pathway. Ultimately, our ability to interpret the function and regulation of cell signalling pathways is crucial for understanding the ways in which cells respond to external cues and how they communicate with each other. In this regard, the systematic collection of pathway information in the form of pathway databases and the application of mathematical analysis for pathway modelling are crucial. Several databases containing information on cell signalling pathways have been developed in conjunction with methodologies to access and analyse the data (Suderman and Hallett, 2007). Furthermore, mathematical modelling emerged as a solution to study the complex behaviour of networks (Alves et al, 2006; Fisher and Henzinger, 2007; Karlebach and Shamir, 2008). The models, so far obtained, allow formulating hypothesis that can be tested in the laboratory. Iterative cycles of prediction and experimental verification have resulted in the refinement of our knowledge of cell signalling, and have shed light on different aspects of cell signalling at a systems level (regulatory aspects, such as feedback control circuits or architectural features, such as modularity). Furthermore, signalling cascades are not isolated units within the cell, but form part of a mesh of interconnected networks through which the signal elicited by an environmental cue can traverse (Yaffe, 2008). Ultimately, each cell is exposed to a variety of signalling cues, and the specificity of the response will be determined by the signalling mechanisms that are activated by the cue (Alberts et al, 2007). Recent research highlights the importance of the, so called, crosstalks between pathways, such as the recently published connections between signalling through the purinergic receptors and the Ca2+ sensing (Chaumont et al, 2008); the link between extracellular glycocalyx structure and nitric oxide signalling pathway (Tarbell and Ebong, 2008); the interactions between insulin and epidermal growth factor signalling (Borisov et al, 2009) and the crosstalk between phosphoinositide 3 kinase and Ras/extracellular signal-regulated kinase signalling pathways (Wang et al, 2009). An important goal of this research is to achieve a reconstruction of the network of interactions that gives rise to a signalling pathway in a biologically consistent and meaningful manner that in turn allows the mathematical analysis of the emerging properties of the network. In this regard, comprehensive maps of signalling pathways have been developed by manual curation from literature (Oda et al, 2005; Oda and Kitano, 2006; Calzone et al, 2008). Building such reference maps requires huge effort and would benefit from information stored in databases and from automatic retrieval and integration methods. Once a reconstruction of the network of interactions is achieved, analysis of the structural features of the network and its dynamic behaviour can take place. A commonly seen architecture of signalling pathways is called ‘bow-tie’, in which many input and output signals are handled by a common layer constituted by a small number of conserved components. This network architecture provides robustness and flexibility to a variety of external cues due to the redundancy of reactions that are part of the input and output layers (Kitano, 2007a). Robustness refers to the ability of an organism to compensate the effects of perturbations to maintain the organism's functions (Kitano, 2007b). Such perturbations can be changes in the availability of nutrients as well as the presence of mutagens or toxins. Moreover, systems can be subjected to functional disruptions when facing perturbations for which they are not optimized, thus showing points of fragility of the biological system (Kitano, 2007b). For instance, an undesired effect of a drug can be caused by the unwanted interaction of the drug with molecules that represent points of fragility of the physiological system (Kitano, 2007a). In contrast, drugs can be completely ineffective when the robustness of the system compensates their action. It has been suggested that crosstalks between signalling pathways contribute to the robustness of cells against perturbations (Kitano, 2007a). In addition, the points of fragility of the system are sometimes exploited by pathogens causing diseases, or represent processes that are usually found to malfunction in particular diseases, such as cancer. Diseases that arise from dysfunction in cell signalling are usually not attributed to a single gene but to the failure of emerging control mechanisms in the network. It has been reported that the loss of negative feedback loops characterizes solid tumours (Amit et al, 2007). These diseases are difficult to diagnose and treat unless accurate understanding of the underlying principles regulating the system is in place. Thus, the interpretation of the global properties of signalling pathways has important implications for the elucidation of the mechanisms that lead to complex diseases, and also for the identification of drug targets. At present, there are several repositories of information on cell signalling pathways that cover a wide range of signal transduction mechanisms and include high quality data in terms of annotation and cross references to biological databases. In principle, the scenario is prepared to make use of the information for the analysis of the behaviour of the signalling pathways. Thus, are the knowledge repositories on signalling pathways ready to realize the systems biology promise? In this article, we aim to initiate this discussion and to provide some insights on this issue. First, we present an analytical overview of current pathway databases (see Pathway databases). In section ‘Case study: EGFR signalling’, we present the results of an evaluation exercise conducted to determine the accuracy and completeness of current pathway databases in front of an expert-curated pathway used as ‘gold standard’. Moreover, we propose a strategy for the use of pathway data from public databases for network modelling (Box 1; Table I). Finally, in the section ‘Conclusions and perspectives’ we discuss the strengths and limitations of the current pathway databases and their usefulness in practical biological problems and applications. Box 1 Use of data from public pathway databases for modelling purposes Box 1 Most public, available pathway databases offer their data in BioPAX format, which was developed for detailed pathway representation and as data exchange format. For storing and sharing of computational models of biological networks, SBML has emerged as standard and is supported by most modelling software. BioPAX and SBML, the two main standards for the representation of biological networks, have been discussed in detail by others (Stromback and Lambrix, 2005; Stromback et al, 2006). In Table I, we briefly list the most important features of the SBML and BioPAX standards. A scenario in which pathway data were directly used for network modelling is proposed here. One or more pathways represented in BioPAX format are automatically retrieved from different databases and imported into a pathway visualization and analysis tool. Then, integration of the different pathways can take place to obtain a comprehensive and biologically meaningful representation of the network. In addition, annotations can be added if required or structural analysis of the network can be carried out. The resulting network, which integrates the original pathways retrieved from the databases, is exported to SBML format and subjected to modelling. If a quantitative approach is chosen, additional information, such as rate constants are required to start the modelling process. In this process, conversion between the two formats is required to achieve inter-operability between pathway and model representations. Some solutions are already available. The BioModels (http://www.ebi.ac.uk/biomodels-main/) database, which contains a variety of curated models in SBML format, offers conversion to BioPAX format. The opposite conversion, from BioPAX to SBML, would open the possibility of modelling the pathways stored in public databases. However, the inter-conversion between BioPAX and SBML is not trivial as both formats where developed for different purposes. BioPAX, for instance, does not offer the possibility to store quantitative information needed for kinetic modelling, whereas SBML does not represent relationships between nodes that are not needed for modelling and that are present in BioPAX. Examples of approaches for the conversion from BioPAX to SBML are BiNoM (Zinovyev et al, 2008), which is available as Cytoscape plugin, and SyBil, which is part of the model environment for quantitative modelling VCell (Evelo, 2009). Although compatibility of different pathway and network model exchange formats is still not completely achieved, the efforts made towards this goal represent significant contributions to pathway retrieval, integration and subsequent modelling. Table 1. Comparison between SBML and BioPAX SBML BioPAX Representation format XML (Extensible Markup Language) OWL (Web Ontology Language), XML Main purpose Representation of computational models of biological networks Pathway description with all details on reactions, components, information on cellular location etc. Entities and reactions Based on species and reactions (Hucka et al., 2003): Basic ontology based on three classes (http://www.biopax.org/): Species (proteins, small molecules etc.) Pathway (set of interactions) Reactions (how species interact) Physical entity with subclasses, such as RNA, DNA, protein, complex and small molecules Compartment (in which interactions take place) Interaction with subclasses, such as conversion having biochemicalReaction as subclass, etc. Number of pathways represented One model per SBML file Several pathways per BioPAX file possible (each object has its own RDF id and is hence uniquely identifiable) Reaction kinetics Allows representation of kinetics, including parameters for reaction rates, initial concentrations etc. No kinetics as BioPAX is not meant for modelling but pathway representation Levels Built in levels with different versions. Each level adds new features, such as the incorporation of controlled vocabularies. At the time of writing, the most stable version is SBML Level 2 BioPAX Level 1: representation of chemical reactions involved in metabolism BioPAX Level 2: adds molecular interactions and protein post-translational modifications BioPAX Level 3: any kind of biological reaction, including regulation of gene expression (BioPAX L3 is at the time of writing still in release process) The BioPAX project roadmap envisages two additional levels capturing interactions at the cellular level. (http://www.biopax.org/Docs/BioPAX_Roadmap.html) Pathway database support Reactome Reactome KEGG KEGG (only BioPAX Level 1) PID PathwayCommons Model database support BioModels BioModels (conversion from SBML to BioPAX possible) Library for reading/writing libSBML(Bornstein et al, 2008) Paxtools (http://www.biopax.org/paxtools/) Software support Standard modelling software, such as CellDesigner or Copasi (Hoops et al, 2006) Network visualization software, such as Cytoscape or VisANT Network visualization software, such as Cytoscape Pathway databases Pathway databases serve as repositories of current knowledge on cell signalling. They present pathways in a graphical format comparable to the representation in text books, as well as in standard formats allowing exchange between different software platforms and further processing by network analysis, visualization and modelling tools. At present, there exist a vast variety of databases containing biochemical reactions, such as signalling pathways or protein–protein interactions. The Pathguide resource serves as a good overview of current pathway databases (Bader et al, 2006). More than 200 pathway repositories are listed, from which over 60 are specialized on reactions in human. However, only half of them provide pathways and reactions in computer-readable formats needed for automatic retrieval and processing, and even less support standard formats, such as Biological Pathway Exchange (BioPAX) (http://www.biopax.org) and Systems Biology Markup Language (SBML) (Hucka et al, 2003). To obtain a complete view of the biological process of interest, combination of information from diverse reactions and pathways is often needed. A recent publication (Adriaens et al, 2008), describes a workflow developed for gathering and curating all information on a pathway to obtain a broad and correct representation. However, the described process heavily relies on manual intervention. Consequently, there is a need for the automation of both the pathway retrieval process and the integration of different data sources. This section is devoted to the description of main pathway databases: Reactome, Kyoto Encyclopedia of Genes and Genomes (KEGG), WikiPathways, Nature Pathway Interaction Database (PID) and Pathway Commons. Table II lists all pathway databases and protein–protein interaction resources that are mentioned in this section. Table 2. Online pathway and protein–protein interaction (PPI) databases Pathway/PPI database Web link Standard exchange formats for download Web service API Reactome http://www.reactome.org BioPAX Level 2 SOAP web service API BioPAX Level 3 (only some reactions) Detailed user manual available, example client in Java SBML Level 2 KEGG http://www.genome.jp/kegg/pathway.html KGML (default format) SOAP web service API BioPAX Level 1 (only metabolic reactions) Example client in Java, Ruby, Perl SBML (using converter) Direct import into Cytoscape GPML (using converter) WikiPathways http://www.wikipathways.org GPML (default format) SOAP web service API Converters to standards, such as SBML and BioPAX are in progress Example clients in Java, Perl, Python, R NCI/Nature Pathway Interaction Database (PID) http://pid.nci.nih.gov PID XML (default format) Access through Pathway Commons BioPAX Level 2 BioCarta http://www.biocarta.com BioPAX Level 2 through NCI/ Nature Pathway Interaction Database (PID) Pathway commons http://www.pathwaycommons.org BioPAX Level 2 (default format for pathways) HTTP URL-based XML web service through cPath PSI-MI (default format for protein–protein interactions) Direct import into Cytoscape Cancer cell map http://cancer.cellmap.org BioPAX Level 2 HTTP URL-based XML web service via cPath HumanCyc http://humancyc.org BioPAX Level 2 Access through Pathway Commons and Pathway Tools (Karp et al, 2002) BioPAX Level 3 IntAct www.ebi.ac.uk/intact/ PSI-MI Access through Pathway Commons HPRD http://www.hprd.org PSI-MI Access through Pathway Commons MINT http://mint.bio.uniroma2.it/mint/ PSI-MI Access through Pathway Commons Reactome Reactome is currently one of the most complete and best-curated pathway databases. It covers reactions for any type of biological process and organizes them in a hierarchal manner. In this hierarchy, the lower level corresponds to single reactions, whereas the upper level represents the pathway as a whole. Reactome was first developed as an open source database for pathways and interactions in human. Equivalent reactions for other species are inferred from the human data (Vastrik et al, 2007), providing coverage to 22 non-human species, including mouse, rat, chicken, puffer fish, worm, fly, yeast, and Escherichia coli. Furthermore, other Reactome projects exist focusing on single species, such as the Arabidopsis Reactome (http://www.arabidopsisreactome.org). All pathway and reaction data in Reactome are extracted from biomedical experiments and literature. For this purpose, PhD-level biologists are invited to work together with the Reactome curators and editors on the curation of data on selected biological processes. Once the first outline of the biological process is created and annotated, it is inspected by peer reviewers and potential inconsistencies and errors are fixed. Every two years the data are reviewed to keep it updated (Joshi-Tope et al, 2005; Matthews et al, 2009). Moreover, cross references to different databases, such as UniProt (The UniProt Consortium, 2008), Ensembl (http://www.ensembl.org/index.html), NCBI (http://www.ncbi.nlm.nih.gov), Gene Ontology (GO) (Ashburner et al, 2000), Entrez Gene (Maglott et al, 2007), UCSC Genome Browser (http://genome.ucsc.edu), HapMap (http://www.hapmap.org), PubMed, as well as to other pathway databases, such as KEGG (Kanehisa and Goto, 2000) are provided. Pathways are presented as chains of chemical reactions and the same data model is used to describe reactions for any biological process, such as transcription, catalysis or binding (Matthews et al, 2007). Altogether, this represents a coherent view of pathway knowledge. The data model is based on classes, such as physical entity or event. Physical entities comprise proteins, DNA, RNA, small molecules but also complexes of single entities. Proteins, RNA and DNA, for which the sequence is known, are linked to the appropriate databases. Chemical entities such as small molecules are linked to ChEBI (http://www.ebi.ac.uk/chebi/init.do). An event can be either a ReactionLikeEvent, which represents reactions that convert an input into an output, or a PathwayLikeEvent, grouping together several related events. Each class possesses properties, such as information on the type of interaction (e.g. inhibition or activation). Reactome explicitly considers the different states an entity can show in a reaction. The phosphorylated and the unphosphorylated version of a protein are, for example, represented as separate entities. In addition, generalization is allowed. This means that if two different entities have exactly the same function in a reaction, such as isoenzymes, the reaction is only described once and the functional equivalent entities belong to the same defined set. Another interesting element of the Reactome data model is the use of candidate sets, which act as placeholders for all possible entities in a reaction, in case the exact entity involved in the reaction is not yet known. Reactome can either be directly browsed or queried by text search using, for instance, UniProt accession numbers. In addition, some tools for advanced queries are provided. The PathFinder tool allows connecting an input to an output molecule or event by constructing the shortest path between both. The SkyPainter tool can be used to identify events or pathways that are statistically over-represented for a list of genes or proteins. Moreover, Reactome data can be combined with other databases such as UniProt, by using the Reactome BioMart (http://www.biomart.org) tool. In addition to browsing pathways through the Reactome web interface, it is possible to download the data for local visualization and analysis using other tools. Different formats are provided for pathway download, including SBML Level 2, and BioPAX Level 2 and Level 3 (for some reactions only), as well as graphical formats. Pathway files, for instance, in BioPAX format can be directly opened in Cytoscape (Shannon et al, 2003), a software for the visualization and analysis of networks. Moreover, data can be programmatically accessed through a SOAP web service. KEGG KEGG is not only a database for pathways but consists of 19 highly interconnected databases, containing genomic, chemical and phenotypic information (Kanehisa and Goto, 2000; Kanehisa et al, 2008). Here we concentrate on the database storing biological pathways. KEGG categorizes its pathways into metabolic processes, genetic information processing, environmental information processing, including signalling pathways, cellular processes, information on human diseases and drug development. However, the best-organized and most complete information can be found for metabolic pathways. KEGG is not organism specific but covers a wide range of organisms, including human. The pathways are manually curated by experts using literature. In addition to the interconnection of all databases underlying KEGG, links to external databases, such as NCBI Entrez Gene, OMIM, UniProt and GO are provided. Pathways can either be browsed or queried by free text search. The user can search for gene names, chemical compounds or whole pathways. A tutorial on how to browse pathways in KEGG and an overview of the multiple representation formats is available (Aoki-Kinoshita & Minoru Kanehisa, 2007). Each pathway stored in KEGG can be downloaded in its own XML format named KGML, which is supported by VisANT, a software tool for pathway visualization (Hu et al, 2008b) and indirectly by Cytoscape using scripting plugins. In addition, metabolic pathways are available in BioPAX Level 1, which was especially designed for metabolic reactions, as well as in SBML. For converting KEGG metabolic pathways to SBML, a tool called KEGG2SBML (http://sbml.org/Software/KEGG2SBML) was developed. KEGG data can also be accessed using the KEGG API or KEGG FTP. Moreover, for making use of the KEGG resources, several applications exist. KegArray, for example, allows the analysis of microarray data in the context of KEGG pathways. WikiPathways A recently developed resource for pathway information that strongly differs from other pathway repositories is WikiPathways. WikiPathways is an open source project based, like Wikipedia, on the MediaWiki software (Pico et al, 2008). It serves as an open and collaborative platform for creation, edition and curation of biological pathways in different species. WikiPathways aims to achieve a public commitment to pathway storage and curation by keeping pathway creation and curation processes simple. Although the curation process of the previously described databases is subjected to experts, any user with an account on WikiPathways can create new pathways, and edit already existing ones. The pathway entities are linked to reference databases, based on the criteria provided by the editor. Hence, the identifiers depend on the chosen reference database and can therefore differ between pathways and even within a single pathway. Pathways in WikiPathways can be browsed by species and categories, for example, Metabolic Process. They can also be searched using gene, protein or pathway name or any free text query. In addition, pathways can be programmatically accessed through a web service (http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice). For pathway data exchange, WikiPathways does not use standard formats like BioPAX or SBML, but offers a much simpler representation called GenMAPP Pathway Markup Language (GPML) that is compatible with visualization and analysis tools, such as Cytoscape, GenMAPP (Salomonis et al, 2007) and PathVisio (van Iersel et al, 2008). The use of GPML is in agreement with the community annotation nature of the project, as it offers a simple pathway representation and several functionalities for building network diagrams. However, inter-operability with other pathway databases is impeded, and substantial efforts towards combining WikiPathways with the other pathway repositories will be required. In this regard, some approaches with the objective of conversion between GPML and standard pathway exchange formats, such as SBML and BioPAX, are under development (Evelo, 2009). In addition, KEGG pathways in KGML format are also available in GPML format ready for download (http://www.pathvisio.org/Download#Step_3) or can be converted into GPML (http://www.bigcat.unimaas.nl/tracprojects/pathvisio/wiki/KeggConverter). The exponential growth of biological data poses a challenge to the high-quality annotation and curation of databases. In this scenario, the use of wikis for community curation of biological data have emerged in the past years with the goal of increasing quality of data annotation by combining knowledge from multiple experts (Giles, 2007; Waldrop, 2008; Hu et al, 2008a). However, their success will strongly depend on the commitment of the community and WikiPathways authors claim that the initiative represents an experiment, in which the ‘community curation’ approach is being tested (Pico et al, 2008). Thus, WikiPathways can be seen as a complementary and enhancing source of information for the major pathway databases, like Reactome or KEGG. In contrast to the aforementioned databases, the systems described below combine diverse pathway repositories, and can be seen as first attempts towards the integration of pathway information from various sources. Nature pathway interaction database PID contains data on cell signalling in humans (Schaefer et al, 2009). PID combines three different sources: the NCI-curated pathways that are obtained from peer reviewed literature, as well as pathways imported from Reactome and BioCarta. Similar to Reactome, PID structures pathways hierarchically into pathways and their sub-pathways that are called sub-networks in PID. The PID data model is based on molecular interactions in which input biomolecules are transformed into output biomolecules. Each process can be promoted or inhibited by regulators. Biomolecules are proteins, RNA, complexes or small molecules. DNA is not a part of the PID data model and only output RNA and regulator are represented in transcriptional processes. Each protein is cross-referenced to UniProt, RNA to Entrez Gene, small molecules to the Chemical Abstracts Service (CAS) registry number and complexes are annotated using GO terms. Different states of biomolecules, such as ‘active/inactive’ or ‘phosphorylated’ are part of the annotations of the biomolecule. Cellular location
DOI: 10.1371/journal.pone.0020284
2011
Cited 168 times
Gene-Disease Network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental Diseases
Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult.We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell.For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases.The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download.
DOI: 10.1038/clpt.2013.47
2013
Cited 151 times
Pharmacovigilance Using Clinical Notes
Clinical Pharmacology & TherapeuticsVolume 93, Issue 6 p. 547-555 ArticlesOpen Access Pharmacovigilance Using Clinical Notes P LePendu, Corresponding Author P LePendu [email protected] Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorS V Iyer, S V Iyer Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorA Bauer-Mehren, A Bauer-Mehren Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorR Harpaz, R Harpaz Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorJ M Mortensen, J M Mortensen Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorT Podchiyska, T Podchiyska Stanford Center for Clinical Informatics, Stanford University, Stanford, California, USASearch for more papers by this authorT A Ferris, T A Ferris Stanford Center for Clinical Informatics, Stanford University, Stanford, California, USASearch for more papers by this authorN H Shah, N H Shah Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this author P LePendu, Corresponding Author P LePendu [email protected] Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorS V Iyer, S V Iyer Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorA Bauer-Mehren, A Bauer-Mehren Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorR Harpaz, R Harpaz Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorJ M Mortensen, J M Mortensen Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorT Podchiyska, T Podchiyska Stanford Center for Clinical Informatics, Stanford University, Stanford, California, USASearch for more papers by this authorT A Ferris, T A Ferris Stanford Center for Clinical Informatics, Stanford University, Stanford, California, USASearch for more papers by this authorN H Shah, N H Shah Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this author First published: 04 March 2013 https://doi.org/10.1038/clpt.2013.47Citations: 8AboutSectionsPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL Share a linkShare onFacebookTwitterLinkedInRedditWechat Abstract With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient–feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug–adverse event associations and adverse events associated with drug–drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk. Clinical Pharmacology & Therapeutics (2013); 93 6, 547–555. doi:10.1038/clpt.2013.47 Phase IV surveillance is a critical component of drug safety because not all safety issues associated with drugs are detected before market approval. Each year, drug-related events account for up to 50% of adverse events occurring in hospital stays,1 significantly increasing costs and length of stay in hospitals.2 As much as 30% of all drug reactions result from concomitant use—with an estimated 29.4% of elderly patients on six or more drugs.3 Efforts such as the Sentinel Initiative and the Observational Medical Outcomes Partnership4 envision the use of electronic health records (EHRs) for active pharmacovigilance.5,6,7 Complementing the current state of the art—based on reports of suspected adverse drug reactions—active surveillance aims to monitor drugs in near real time and potentially shorten the time that patients are at risk. Coded discharge diagnoses and insurance claims data from EHRs have already been used for detecting safety signals.8,9,10 However, some experts argue that methods that rely on coded data could be missing >90% of the adverse events that actually occur, in part because of the nature of billing and claims data.1 Researchers have used discharge summaries (which summarize information from a care episode, including the final diagnosis and follow-up plan) for detecting a range of adverse events11 and for demonstrating the feasibility of using the EHR for pharmacovigilance by identifying known adverse events associated with seven drugs using 25,074 notes from 2004.12 Therefore, the clinical text can potentially play an important role in future pharmacovigilance,13,14 particularly if we can transform notes taken daily by doctors, nurses, and other practitioners into more accessible data-mining inputs.15,16,17 Two key barriers to using clinical notes are privacy and accessibility.16 Clinical notes contain identifying information, such as names, dates, and locations, that are difficult to redact automatically, so care organizations are reluctant to share clinical notes. We describe an approach that computationally processes clinical text rapidly and accurately enough to serve use cases such as drug safety surveillance. Like other terminology-based systems, it deidentifies the data as part of the process.18 We trade the "unreasonable effectiveness"24 of large data sets in exchange for sacrificing some individual note-level accuracy in the text processing. Given the large volumes of clinical notes, our method produces a patient–feature matrix encoded using standardized medical terminologies. We demonstrate the use of the resulting patient–feature matrix as a substrate for signal detection algorithms for drug–adverse event associations and drug–drug interactions. RESULTS Our results show that it is possible to detect drug safety signals using clinical notes transformed into a feature matrix encoded using medical terminologies. We evaluate the performance of the resulting data set for pharmacovigilance using curated reference sets of single-drug adverse events as well as adverse events related to drug–drug interactions. In addition, we show that we can simultaneously estimate the prevalence of adverse events resulting from drug–drug interactions. The reference set, described in the Methods section, contains 28 positive associations and 165 negative associations spanning 78 drugs and 12 different events for single drug–adverse event associations. For the drug–drug interactions, the reference set contains 466 positive and 466 negative associations spanning 333 drugs across 10 events. Feasibility of detecting drug–adverse event associations To demonstrate the feasibility of using free text–derived features for detecting drug–adverse event associations, we reproduce the well–known association between rofecoxib and myocardial infarction. Rofecoxib was taken off the market because of the increased risk of heart attack and stroke.19,20 We compute an association between rofecoxib and myocardial infarction, keeping track of the temporal order of the diagnosis of rheumatoid arthritis, exposure to the drug, and occurrence of an adverse event as described in the Methods section. Using data up to 2005, we obtain an odds ratio (OR) of 1.31 (95% confidence interval (CI): 1.16–1.45) for the association, which agrees with previously reported results.19,20 In a previous study, we compared using clinical notes with using the codes from the International Classification of Diseases, Ninth Revision (ICD–9), and found no association (OR: 1.71; 95% CI: 0.74–3.53) using the coded data.21 This is probably due to undercoding: for patients to be counted as exposed requires a prior arthritis indication, and approximately one–third of the patients meet that criterion. Performance of detecting adverse drug events Figure 1 shows the adjusted ORs and 95% CIs for the 28 true-positive associations from our single drug–adverse event reference set. As expected, the results show some variation by event across the adverse events.10 Figure 2a shows the overall performance for detecting associations between a single drug and its adverse event, with an area under the receiver operating characteristic curve (AUC) of 75.3% (unadjusted) and 80.4% (adjusted). A threshold of 1.0 (a commonly used cutoff) on the lower bound of the 95% CI of the adjusted ORs translates to 39% sensitivity and 97.5% specificity. Choosing a signaling threshold, defined using minimum specificity of 90%, based on the receiver operating characteristic curve, yields a cutoff of 1.18 (unadjusted) and 0.84 (adjusted) on the lower bound of the 95% CI. Supplementary Data S1 online lists all adjusted results, and Supplementary Data S2 online lists the AUC threshold data. Figure 1Open in figure viewerPowerPoint Adjusted odds ratios (ORs) for positive cases in the single drug–adverse event set. Results show some variability by event. The 28 positive cases include the following events: myocardial infarction (mi), rhabdomyolysis (rhabd), cardiovascular fibrosis (cvf), acute renal failure (arf), QT prolongation (qt), urinary bladder cancer (ubc), progressive multifocal leukoencephalopathy (pml), aplastic anemia (aa), and venous thrombosis (vt). Some associations are off the scale, and we indicate the OR in parenthesis above the line (one exception, Natalizumab–pml (232), is not shown at all due to extreme scale: OR: 79.5; 95% CI: 30.8–270.4). We also include the number of exposed patients in parenthesis for each drug–adverse event pair. Typically, a signal occurs when the lower bound of the confidence intervals exceed 1.0; however, this threshold may have different optimal settings on the basis of the event. Figure 2Open in figure viewerPowerPoint Performance of adverse drug reactions and drug–drug interaction detection. Overall performance is measured using areas under the receiver operating characteristic curve (AUCs). (a) The unadjusted (blue) vs. adjusted (red) methods yield AUCs of 75.3 and 80.4% overall. (b) For drug interactions, the adjusted methods (red) reach 81.5% AUC. Profiling drug–adverse event associations over time Figure 3 shows the cumulative ORs and exposures over time based on the unadjusted associations for the 10 drugs in our reference set that have had an alert in the past decade. Using a threshold of 1.0 on the lower bound of the CI for the association, we would flag six of nine alerts earlier than the official date (we do not have enough data for one drug, troglitazone). By comparison, the propensity-adjusted method would catch three of the alerts early. The unadjusted associations can flag signals worth investigating, and the adjusted associations may reduce false alarms. Figure 3Open in figure viewerPowerPoint Cumulative (unadjusted) odds and exposure plots for 10 positive cases involving US Food and Drug Administration (FDA) intervention. Signals are flagged earlier than official alerts in six of nine cases (troglitazone excluded for lack of sufficient exposure). The solid red line is the odds ratio (OR), and the dotted red lines are the confidence intervals (CIs). The solid blue line is the exposure rate. The shaded area marks the period for which FDA intervention applies (e.g., withdrawal). The point estimate marks the earliest year and OR when the lower bound of the 95% CI is above a threshold of 1.0, i.e., when the unadjusted method would flag the drug for monitoring. As more data accumulate and exposure increases, patterns often converge toward more confident signals. cvf, cardiovascular fibrosis; mi, myocardial infarction; qt, QT prolongation; rhabd, rhabdomyolysis; ubc, urinary bladder cancer. Performance of detecting adverse drug–drug interactions Figure 2b shows the performance (AUC of 81.5%) for detecting known adverse events arising from drug–drug interactions. Adjusting the associations for potential confounding improves the signal detection capability (red curves in Figure 2b).22 In the drug–drug interaction scenario, we do not constrain by drug indications because of combinatorial complexity. We obtain 52% sensitivity at 91% specificity, using 1.0 as a threshold on the lower bound of the CI for the adjusted associations. Estimating the prevalence of adverse events Population-level prevalence data for adverse events are hard to come by. For single drugs, sources such as Side Effect Resource provide information on the frequency of specific adverse events from the drug product label. No such comparable resource exists for adverse events arising from drug–drug interactions. While performing the drug–adverse event association calculations using data from a clinical data warehouse, we can in parallel estimate the prevalence of adverse events associated with drug–drug interactions. For example, we found that 42.8% (176 of 411) of patients on both levodopa and lorazepam experience parkinsonian symptoms, 19.8% (140 of 707) of patients on paclitaxel and trastuzumab experience neutropenia, and 17.8% (796 of 4,467) of patients on amiodarone and metoprolol experience bradycardia. DISCUSSION We have demonstrated that adverse drug events as well as adverse events associated with drug–drug interactions can be detected using a deidentified patient–feature matrix extracted from free-text clinical documents. Blumenthal and others5 envision a scenario in which a new drug comes to market and a nationwide learning system monitors for safety signals. Our results show that deidentified clinical notes can be used to generate drug safety signals—taking a step toward such a scenario. In addition, the patient–feature matrix also provides prevalence data not available from other data sources (e.g., spontaneous reports). Having such prevalence information can assist in prioritizing actionable events and reducing alert fatigue.23 Our approach to processing clinical notes is simple in comparison with advanced natural language processing (NLP) systems that may have better accuracy in identifying nuanced attributions of disease conditions. We sacrifice some individual note-level accuracy in exchange for the ability to detect population-level trends against massive data sets. Our results, based on a reference set of known drug–event pairs, show that when exposure data are numerous enough, the use of relatively simple text mining with standard association strength tests for signal detection can work, reflecting the adage in the machine-learning community that "a dumb algorithm with lots of data beats a clever one with modest amounts of it."24,25 When used in combination with other data sets, clinical notes may address cases that otherwise pass undetected. We sacrifice sensitivity for specificity because for a new approach, and a new data source (clinical notes), keeping false-discovery rates low is important, particularly in the initial stages of establishing feasibility. We find that ontologies are an excellent source of features and allow systematic normalization and aggregation when the feature set needs reduction.15,26 For example, we can count all patients who experience cardiac arrhythmias as patients with arrhythmias because of the hierarchical relationships. Therefore, ontology hierarchies can organize a very large number of terms into a smaller feature set. Moreover, because names, dates, and locations are not present in the clinical terminologies, those are not extracted as features by dictionary-based methods.18,27 We believe that the information embedded in text is crucial for leveraging EHR data,10,13,14 particularly for rare events for which large amounts of data are needed. Our annotation-based approach produces a feature matrix that complements other structured data such as codes from the ICD-9. Of note, our methods are not dependent on any particular NLP tool (we contrast MGREP and UNITEX in the Methods section), and we expect results to improve given the availability of better and faster clinical NLP tools.28,29 We are currently collaborating with researchers at the Mayo Clinic to improve the speed of the clinical Text Analysis and Knowledge Extraction System,29 one of the state-of-the-art NLP tools available for clinical text. Broader availability of curated clinical NLP data sets and health outcome definitions would accelerate research and validation. Our work has several limitations and opportunities for improvement. Not all conditions are equally identifiable from text using lexical approaches (Supplementary Data S3 online reports validation results by condition). Advanced NLP tools would improve accuracy in these cases. Biases in our reference set—although among the largest used for such a study—affect our performance estimation. A new reference standard covering four events has just recently been released by the Observational Medical Outcomes Partnership,4 and we are currently evaluating its utility. Some adverse drug events are dose dependent, and our methods currently ignore this information. The UNITEX tool, described in the Methods section, includes libraries for dosage extraction and thus is a logical next step. We do not distinguish between new users of drugs and existing or chronic ones. Our methods have a limited ability to define eras (durations of medication and illness). We are currently examining the annotation data for the utility of the last mention of a concept, sentence-level co-occurrences, and temporal density of mentions to address this question. The majority of our findings are based on the Stanford Hospital and Clinics, which is a tertiary-care center representing a skewed population. At the same time, this population has added utility for investigating rare events. Variations in signaling thresholds can also occur as a result of the prevalence or rarity of an event,10 and more research is needed to adapt detection algorithms accordingly. The prevalence data estimated in studies such as ours are an important step in this direction.10 Finally, we note that the Observational Medical Outcomes Partnership group suggests that no single method works best uniformly, that different methods be considered for each event and data source, and that profiling performance via receiver operating characteristic curves assists in understanding the utility of a method or data source.4 To conclude, our method extracts from textual clinical notes a deidentified patient–feature matrix encoded using standardized medical terminologies. We have demonstrated the use of the resulting patient–feature matrix as a substrate for detecting single drug–adverse event associations (AUC of 80.4%) and for detecting adverse events associated with drug–drug interactions (AUC of 81.5%), illustrating that clinical notes can be a source for detecting drug safety signals at scale.15 The patient–feature matrix can also be used to learn off-label usage30 and to discern drug adverse events from indications.31 Using the textual contents of the EHR complements efforts using billing and claims data or spontaneous reports4,8,14,32,33 and opens up new opportunities for leveraging observational data. METHODS Data sources. Our primary data source was the Stanford Translational Research Integrated Database Environment,34 which spans 18 years of patient data from 1.8 million patients; it contains 19 million encounters, 35 million coded ICD-9, diagnoses, and >11 million unstructured clinical notes, which are a combination of pathology, radiology, and transcription reports. The gender split is ~60% female; the average age is 44 with an SD of 25. The reference standard. We created reference standards of known drug–adverse event associations for testing the performance of our methods in detecting drug safety signals from text. Supplementary Data S4 online lists the single drug–event reference set. For the single-drug adverse events, our reference set included 12 distinct events worth monitoring35 and 78 distinct drugs, 28 positive cases, and 165 negative cases. We started with a validation set from the European Union adverse drug event project (EU-ADR)36 and to that set, we added 10 drug safety signals that involved US Food and Drug Administration intervention in the past decade, manually curating these from the literature and cross-referencing with the agency's website. We established our false-discovery rate by generating a set of negative associations by creating all combinations of drugs and events and subtracting any known associations that were identified by any one of the EU-ADR filtering workflows,37 the Medi-Span (Wolters Kluwer Health, Indianapolis, IN) Adverse Drug Effects Database, or the Side Effect Resource database.38 For the two-drug case, known drug–drug interactions were extracted (and manually validated) from textual monographs in DrugBank and the Medi-Span Drug Therapy Monitoring System. In this case, we simulated the negative set by associating drug pairs with a randomly chosen event, removing any cases that were already known to be associated on the basis of external knowledge (DrugBank, Medi-Span, Drugs.com, Unified Medical Language System (UMLS), or Side Effect Resource). This reference set included 10 distinct events, 333 distinct drugs, 466 positive cases, and 466 negative cases. Testing for drug safety signals. We followed a two-step process for detecting drug safety signals: first, we computed a raw association in the form of an unadjusted OR, followed by adjustment for potential confounders. The first step is useful for flagging putative signals, and the second step is useful in reducing false alarms. In the first step, we computed unadjusted ORs and 95% CIs by constructing a 2 × 2 contingency table26,33 from the patient–feature matrix. On the basis of first mentions of drug, event, and indication and their temporal order, we assigned patients to specific cells of a 2 × 2 contingency table as shown in Figure 4 (see also Supplementary Data S5 online). The temporal information in the patient–feature matrix is critical for determining whether the event follows exposure.39 Patients having no mention of the indication at any time are excluded from the analysis (see Supplementary Data S6 online for those patients being excluded). Using data following the indication, and not counting repeat mentions, the ordering of the drug and event determined into which cell of a 2 × 2 matrix the patient fell. Because all unexposed patients have the indication, they could be on an alternative drug or other treatment, or none at all. Figure 4Open in figure viewerPowerPoint Assignment of patients to 2 × 2 contingency tables. Patients are assigned to cells a, b, c, and d of a 2 × 2 contingency table (C) on the basis of the patterns shown in parts (A) and (B). In the patterns, indications are abbreviated with "I", drugs with "D", and outcomes or events with "E." A patient exposed to the drug is counted in cells "a" or "b" depending on whether the outcome occurs after the drug exposure, based on temporal ordering of first mentions of the I, D, and E. Other patients (i.e., unexposed) are placed in the bottom row of the 2 × 2 contingency table in cells "c" or "d" depending on whether the outcome occurred in the observation duration after the indication. Therefore, for example, an indication followed by a drug and then an event would go into the "a" cell. An indication followed by no drug mention but having an occurrence of the event would go into cell "c." For drug–drug interactions, we do not restrict the assignment on the basis of the indications. Therefore, patients with mentions of both drugs (in either order) before an event would go into the "a" cell. In the second step, we adjusted for confounding by specific patient factors. We included age, gender, race, and comorbidity and coprescription frequency (as a surrogate for overall health status) in calculating the propensity score.9 The propensity score quantified the likelihood of a patient to be exposed to a drug. Patients with known indications were matched (exposed vs. unexposed) via the propensity score. Finally, we included the propensity score as a covariate in logistic regression to compute adjusted ORs and 95% CIs using the coefficients of the regression model. We used the Matching and Survival packages in R.40 For single drug–event associations, we identified the indications of the drug using the Medi-Span Drug Indications Database and the National Drug File–Reference Terminology. In the drug–drug interaction scenario, the key idea is to determine whether the association of the event with the combination of the two drugs outweighs any association of the event with either one of the drugs alone (or none at all). Including the indications adds a degree of combinatorial complexity, so we focused primarily on the temporal order of the two drugs and event (Figure 4b) without restricting by the indications of the drugs. Generating the patient–feature matrix. Our annotator workflow, described previously,21,30 uses ~5.6 million strings from existing terminologies; filters unambiguous terms that are predominantly noun phrases representing drugs, diseases, devices, or procedures; uses the cleaned up lexicon for term recognition in the clinical notes to tag or annotate41 the text; excludes negated terms or terms that apply to family and medical history;42 normalizes all terms using the ontology hierarchies; and finally uses the time stamps of the note to produce a deidentified, temporally ordered patient–feature matrix. The process is summarized in Figure 5 and the individual steps are detailed below. Figure 5Open in figure viewerPowerPoint Generation of the patient–feature matrix. The workflow (1) starts by downloading ~5.6 million strings for every term in ontologies from both the Unified Medical Language System (UMLS) and BioPortal, as well as all trigger terms from NegEx and ConText; (2) uses term frequency and syntactic type information (e.g., predominant noun phrases) from MedLine to prune the set of strings into a clean lexicon; (3) applies the lexicon directly against the textual notes using exact string matching; (4) applies NegEx and ConText rules to filter negation and family history contexts; (5) applies UMLS Metathesaurus and BioPortal mappings and semantic type information to normalize terms into concepts that are grouped by drug, disease, device, or procedure; and (6) results finally in the patient–feature matrix. Each row of the matrix represents a single note that is linked to a single patient, and the time stamps of the notes induce a temporal ordering over the entire patient–feature matrix. Using biomedical ontologies for text annotation. We use existing ontologies as a source of (i) a lexicon of strings that are grouped together and linked to over a million concepts via synonymy (referred to as mappings) and (ii) a hierarchy of >14 million parent–child relationships among those concepts. We use the lexicon to recognize terms in the input text using a tool called MGREP,41 which also tracks the relative position at which each term occurs (Figures 5 and 6). In addition to clinical terms, based on the ConText system,42 we include terms corresponding to contextual cues called "triggers" in our lexicon. Cues such as "denies," "no sign of," and "father has a history of" are used in a postprocessing step to identify terms that are negated or that apply to family or medical history. Terms that correspond to mentions in these contexts are ignored—thus, the subsequent analysis relies on positive, present mentions of concepts. Figure 6Open in figure viewerPowerPoint Sample annotations. (a) A discharge summary is encoded internally using (b) a highly compressed, numerical representation. The strings in parenthesis are keyed to the first column of numbers and are included merely for illustration purposes. (c) The annotations keep track of relative positional information and are so rich owing to the vast lexicon that if we reconstruct the note, very little of the useful information is lost (notice the section headers). The blank areas in the reconstruction represent terms that are not recognized, and terms highlighted in red denote ones that will not be attributed to the present patient because of contextual cues (e.g., family history and negated findings). CABG, coronary artery bypass graft; COPD, chronic obstructive pulmonary disease; CT, computed tomography. The resulting annotations for the Stanford Translational Research Integrated Database Environment data set comprise ~3.75 billion records. It takes 1 hour to generate annotations from 3 million documents using a sin
DOI: 10.1136/amiajnl-2013-001612
2014
Cited 138 times
Mining clinical text for signals of adverse drug-drug interactions
Background and objective Electronic health records (EHRs) are increasingly being used to complement the FDA Adverse Event Reporting System (FAERS) and to enable active pharmacovigilance. Over 30% of all adverse drug reactions are caused by drug–drug interactions (DDIs) and result in significant morbidity every year, making their early identification vital. We present an approach for identifying DDI signals directly from the textual portion of EHRs.
DOI: 10.1007/s10654-010-9432-x
2010
Cited 78 times
Prediction of 60 day case-fatality after aneurysmal subarachnoid haemorrhage: results from the International Subarachnoid Aneurysm Trial (ISAT)
Aneurysmal subarachnoid haemorrhage (aSAH) is a devastating event with substantial case-fatality. Our purpose was to examine which clinical and neuro-imaging characteristics, available on admission, predict 60 day case-fatality in aSAH and to evaluate performance of our prediction model. We performed a secondary analysis of patients enrolled in the International Subarachnoid Aneurysm Trial (ISAT), a randomised multicentre trial to compare coiling with clipping in aSAH patients. Multivariable logistic regression analysis was used to develop a prognostic model to estimate the risk of dying within 60 days from aSAH based on clinical and neuro-imaging characteristics. The model was internally validated with bootstrapping techniques. The study population comprised of 2,128 patients who had been randomised to either endovascular coiling or neurosurgical clipping. In this population 153 patients (7.2%) died within 60 days. World Federation of Neurosurgical Societies (WFNS) grade was the most important predictor of case-fatality, followed by age, lumen size of the aneurysm and Fisher grade. The model discriminated reasonably between those who died within 60 days and those who survived (c statistic = 0.73), with minor optimism according to bootstrap re-sampling (optimism corrected c statistic = 0.70). Several strong predictors are available to predict 60 day case-fatality in aSAH patients who survived the early stage up till a treatment decision; after external validation these predictors could eventually be used in clinical decision making.
DOI: 10.1371/journal.pone.0063499
2013
Cited 72 times
Practice-Based Evidence: Profiling the Safety of Cilostazol by Text-Mining of Clinical Notes
Peripheral arterial disease (PAD) is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF).We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55]), myocardial infarction (OR = 1.00, CI [0.71, 1.39]), or death (OR = 0.86, CI [0.63, 1.18]). Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients.This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.
DOI: 10.1007/978-1-62703-450-0_3
2013
Cited 47 times
Integration of Genomic Information with Biological Networks Using Cytoscape
Cytoscape is an open-source software for visualizing, analyzing, and modeling biological networks. This chapter explains how to use Cytoscape to analyze the functional effect of sequence variations in the context of biological networks such as protein-protein interaction networks and signaling pathways. The chapter is divided into five parts: (1) obtaining information about the functional effect of sequence variation in a Cytoscape readable format, (2) loading and displaying different types of biological networks in Cytoscape, (3) integrating the genomic information (SNPs and mutations) with the biological networks, and (4) analyzing the effect of the genomic perturbation onto the network structure using Cytoscape built-in functions. Finally, we briefly outline how the integrated data can help in building mathematical network models for analyzing the effect of the sequence variation onto the dynamics of the biological system. Each part is illustrated by step-by-step instructions on an example use case and visualized by many screenshots and figures.
DOI: 10.1371/journal.pone.0072148
2013
Cited 41 times
Drug-Induced Acute Myocardial Infarction: Identifying ‘Prime Suspects’ from Electronic Healthcare Records-Based Surveillance System
Drug-related adverse events remain an important cause of morbidity and mortality and impose huge burden on healthcare costs. Routinely collected electronic healthcare data give a good snapshot of how drugs are being used in 'real-world' settings.To describe a strategy that identifies potentially drug-induced acute myocardial infarction (AMI) from a large international healthcare data network.Post-marketing safety surveillance was conducted in seven population-based healthcare databases in three countries (Denmark, Italy, and the Netherlands) using anonymised demographic, clinical, and prescription/dispensing data representing 21,171,291 individuals with 154,474,063 person-years of follow-up in the period 1996-2010. Primary care physicians' medical records and administrative claims containing reimbursements for filled prescriptions, laboratory tests, and hospitalisations were evaluated using a three-tier triage system of detection, filtering, and substantiation that generated a list of drugs potentially associated with AMI. Outcome of interest was statistically significant increased risk of AMI during drug exposure that has not been previously described in current literature and is biologically plausible.Overall, 163 drugs were identified to be associated with increased risk of AMI during preliminary screening. Of these, 124 drugs were eliminated after adjustment for possible bias and confounding. With subsequent application of criteria for novelty and biological plausibility, association with AMI remained for nine drugs ('prime suspects'): azithromycin; erythromycin; roxithromycin; metoclopramide; cisapride; domperidone; betamethasone; fluconazole; and megestrol acetate.Although global health status, co-morbidities, and time-invariant factors were adjusted for, residual confounding cannot be ruled out.A strategy to identify potentially drug-induced AMI from electronic healthcare data has been proposed that takes into account not only statistical association, but also public health relevance, novelty, and biological plausibility. Although this strategy needs to be further evaluated using other healthcare data sources, the list of 'prime suspects' makes a good starting point for further clinical, laboratory, and epidemiologic investigation.
DOI: 10.1002/pds.3375
2012
Cited 40 times
The EU‐ADR Web Platform: delivering advanced pharmacovigilance tools
ABSTRACT Purpose Pharmacovigilance methods have advanced greatly during the last decades, making post‐market drug assessment an essential drug evaluation component. These methods mainly rely on the use of spontaneous reporting systems and health information databases to collect expertise from huge amounts of real‐world reports. The EU‐ADR Web Platform was built to further facilitate accessing, monitoring and exploring these data, enabling an in‐depth analysis of adverse drug reactions risks. Methods The EU‐ADR Web Platform exploits the wealth of data collected within a large‐scale European initiative, the EU‐ADR project. Millions of electronic health records, provided by national health agencies, are mined for specific drug events, which are correlated with literature, protein and pathway data, resulting in a rich drug–event dataset. Next, advanced distributed computing methods are tailored to coordinate the execution of data‐mining and statistical analysis tasks. This permits obtaining a ranked drug–event list, removing spurious entries and highlighting relationships with high risk potential. Results The EU‐ADR Web Platform is an open workspace for the integrated analysis of pharmacovigilance datasets. Using this software, researchers can access a variety of tools provided by distinct partners in a single centralized environment. Besides performing standalone drug–event assessments, they can also control the pipeline for an improved batch analysis of custom datasets. Drug–event pairs can be substantiated and statistically analysed within the platform's innovative working environment. Conclusions A pioneering workspace that helps in explaining the biological path of adverse drug reactions was developed within the EU‐ADR project consortium. This tool, targeted at the pharmacovigilance community, is available online at https://bioinformatics.ua.pt/euadr/ . Copyright © 2012 John Wiley & Sons, Ltd.
DOI: 10.1136/amiajnl-2014-002902
2014
Cited 37 times
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications.We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks.There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets.For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.
DOI: 10.1158/1055-9965.epi-17-0461
2018
Cited 31 times
Regulatory T-cell Genes Drive Altered Immune Microenvironment in Adult Solid Cancers and Allow for Immune Contextual Patient Subtyping
Background: The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissue.Methods: We analyzed public and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 IHC, and our gene signatures.Results: We observed that regulatory T cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissue. A tumor-specific CD8 signature was slightly lower in tumor tissue compared with normal of most (12 of 16) cancers, whereas a Treg signature was higher in tumor tissue of all cancers except liver. Clustering by Treg signature found two groups in colorectal cancer datasets. The high Treg cluster had more samples that were consensus molecular subtype 1/4, right-sided, and microsatellite-instable, compared with the low Treg cluster. Finally, we found that the correlation between signature and IHC was low in our small dataset, but samples in the high Treg cluster had significantly more CD8+ and FoxP3+ cells compared with the low Treg cluster.Conclusions: Treg gene expression is highly indicative of the overall tumor immune environment.Impact: In comparison with the consensus molecular subtype and microsatellite status, the Treg signature identifies more colorectal tumors with high immune activation that may benefit from cancer immunotherapy. Cancer Epidemiol Biomarkers Prev; 27(1); 103-12. ©2017 AACR.
DOI: 10.1371/journal.pcbi.1002457
2012
Cited 36 times
Automatic Filtering and Substantiation of Drug Safety Signals
Drug safety issues pose serious health threats to the population and constitute a major cause of mortality worldwide. Due to the prominent implications to both public health and the pharmaceutical industry, it is of great importance to unravel the molecular mechanisms by which an adverse drug reaction can be potentially elicited. These mechanisms can be investigated by placing the pharmaco-epidemiologically detected adverse drug reaction in an information-rich context and by exploiting all currently available biomedical knowledge to substantiate it. We present a computational framework for the biological annotation of potential adverse drug reactions. First, the proposed framework investigates previous evidences on the drug-event association in the context of biomedical literature (signal filtering). Then, it seeks to provide a biological explanation (signal substantiation) by exploring mechanistic connections that might explain why a drug produces a specific adverse reaction. The mechanistic connections include the activity of the drug, related compounds and drug metabolites on protein targets, the association of protein targets to clinical events, and the annotation of proteins (both protein targets and proteins associated with clinical events) to biological pathways. Hence, the workflows for signal filtering and substantiation integrate modules for literature and database mining, in silico drug-target profiling, and analyses based on gene-disease networks and biological pathways. Application examples of these workflows carried out on selected cases of drug safety signals are discussed. The methodology and workflows presented offer a novel approach to explore the molecular mechanisms underlying adverse drug reactions.
DOI: 10.1186/1546-0096-11-45
2013
Cited 33 times
Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research
Juvenile idiopathic arthritis is the most common rheumatic disease in children. Chronic uveitis is a common and serious comorbid condition of juvenile idiopathic arthritis, with insidious presentation and potential to cause blindness. Knowledge of clinical associations will improve risk stratification. Based on clinical observation, we hypothesized that allergic conditions are associated with chronic uveitis in juvenile idiopathic arthritis patients.This study is a retrospective cohort study using Stanford's clinical data warehouse containing data from Lucile Packard Children's Hospital from 2000-2011 to analyze patient characteristics associated with chronic uveitis in a large juvenile idiopathic arthritis cohort. Clinical notes in patients under 16 years of age were processed via a validated text analytics pipeline. Bivariate-associated variables were used in a multivariate logistic regression adjusted for age, gender, and race. Previously reported associations were evaluated to validate our methods. The main outcome measure was presence of terms indicating allergy or allergy medications use overrepresented in juvenile idiopathic arthritis patients with chronic uveitis. Residual text features were then used in unsupervised hierarchical clustering to compare clinical text similarity between patients with and without uveitis.Previously reported associations with uveitis in juvenile idiopathic arthritis patients (earlier age at arthritis diagnosis, oligoarticular-onset disease, antinuclear antibody status, history of psoriasis) were reproduced in our study. Use of allergy medications and terms describing allergic conditions were independently associated with chronic uveitis. The association with allergy drugs when adjusted for known associations remained significant (OR 2.54, 95% CI 1.22-5.4).This study shows the potential of using a validated text analytics pipeline on clinical data warehouses to examine practice-based evidence for evaluating hypotheses formed during patient care. Our study reproduces four known associations with uveitis development in juvenile idiopathic arthritis patients, and reports a new association between allergic conditions and chronic uveitis in juvenile idiopathic arthritis patients.
DOI: 10.1016/j.annonc.2020.07.013
2020
Cited 24 times
An enhanced prognostic score for overall survival of patients with cancer derived from a large real-world cohort
<h3>Background</h3> By understanding prognostic biomarkers, we gain insights into disease biology and may improve design, conduct, and data analysis of clinical trials and real-world data. In this context, we used the Flatiron Health Electronic Health Record-derived deidentified database that provides treatment outcome and biomarker data from >280 oncology centers in the USA, organized into 17 cohorts defined by cancer type. <h3>Patients and methods</h3> In 122 694 patients, we analyzed demographic, clinical, routine hematology, and blood chemistry parameters within a Cox proportional hazard framework to derive a multivariable prognostic risk model for overall survival (OS), the ‘Real wOrld PROgnostic score (ROPRO)'. We validated ROPRO in two independent phase I and III clinical studies. <h3>Results</h3> A total of 27 variables contributed independently and homogeneously across cancer indications to OS. In the largest cohort (advanced non-small-cell lung cancer), for example, patients with elevated ROPRO scores (upper 10%) had a 7.91-fold (95% confidence interval 7.45–8.39) increased death hazard compared with patients with low scores (lower 10%). Median survival was 23.9 months (23.3–24.5) in the lowest ROPRO quartile Q1, 14.8 months (14.4–15.2) in Q2, 9.4 months (9.1–9.7) in Q3, and 4.7 months (4.6–4.8) in Q4. The ROPRO model performance indicators [C-index = 0.747 (standard error 0.001), 3-month area under the curve (AUC) = 0.822 (0.819–0.825)] strongly outperformed those of the Royal Marsden Hospital Score [C-index = 0.54 (standard error 0.0005), 3-month AUC = 0.579 (0.577–0.581)]. We confirmed the high prognostic relevance of ROPRO in clinical Phase 1 and III trials. <h3>Conclusions</h3> The ROPRO provides improved prognostic power for OS. In oncology clinical development, it has great potential for applications in patient stratification, patient enrichment strategies, data interpretation, and early decision-making in clinical studies.
DOI: 10.1097/ede.0000000000001338
2021
Cited 15 times
Deep Learning-based Propensity Scores for Confounding Control in Comparative Effectiveness Research
Background: Due to the non-randomized nature of real-world data, prognostic factors need to be balanced, which is often done by propensity scores (PSs). This study aimed to investigate whether autoencoders, which are unsupervised deep learning architectures, might be leveraged to compute PS. Methods: We selected patient-level data of 128,368 first-line treated cancer patients from the Flatiron Health EHR-derived de-identified database. We trained an autoencoder architecture to learn a lower-dimensional patient representation, which we used to compute PS. To compare the performance of an autoencoder-based PS with established methods, we performed a simulation study. We assessed the balancing and adjustment performance using standardized mean differences, root mean square errors (RMSE), percent bias, and confidence interval coverage. To illustrate the application of the autoencoder-based PS, we emulated the PRONOUNCE trial by applying the trial’s protocol elements within an observational database setting, comparing two chemotherapy regimens. Results: All methods but the manual variable selection approach led to well-balanced cohorts with average standardized mean differences &lt;0.1. LASSO yielded on average the lowest deviation of resulting estimates (RMSE 0.0205) followed by the autoencoder approach (RMSE 0.0248). Altering the hyperparameter setup in sensitivity analysis, the autoencoder approach led to similar results as LASSO (RMSE 0.0203 and 0.0205, respectively). In the case study, all methods provided a similar conclusion with point estimates clustered around the null (e.g., HR autoencoder 1.01 [95% confidence interval = 0.80, 1.27] vs. HR PRONOUNCE 1.07 [0.83, 1.36]). Conclusions: Autoencoder-based PS computation was a feasible approach to control for confounding but did not perform better than some established approaches like LASSO.
DOI: 10.1186/1471-2105-10-s8-s6
2009
Cited 29 times
From SNPs to pathways: integration of functional effect of sequence variations on models of cell signalling pathways
Single nucleotide polymorphisms (SNPs) are the most frequent type of sequence variation between individuals, and represent a promising tool for finding genetic determinants of complex diseases and understanding the differences in drug response. In this regard, it is of particular interest to study the effect of non-synonymous SNPs in the context of biological networks such as cell signalling pathways. UniProt provides curated information about the functional and phenotypic effects of sequence variation, including SNPs, as well as on mutations of protein sequences. However, no strategy has been developed to integrate this information with biological networks, with the ultimate goal of studying the impact of the functional effect of SNPs in the structure and dynamics of biological networks.First, we identified the different challenges posed by the integration of the phenotypic effect of sequence variants and mutations with biological networks. Second, we developed a strategy for the combination of data extracted from public resources, such as UniProt, NCBI dbSNP, Reactome and BioModels. We generated attribute files containing phenotypic and genotypic annotations to the nodes of biological networks, which can be imported into network visualization tools such as Cytoscape. These resources allow the mapping and visualization of mutations and natural variations of human proteins and their phenotypic effect on biological networks (e.g. signalling pathways, protein-protein interaction networks, dynamic models). Finally, an example on the use of the sequence variation data in the dynamics of a network model is presented.In this paper we present a general strategy for the integration of pathway and sequence variation data for visualization, analysis and modelling purposes, including the study of the functional impact of protein sequence variations on the dynamics of signalling pathways. This is of particular interest when the SNP or mutation is known to be associated to disease. We expect that this approach will help in the study of the functional impact of disease-associated SNPs on the behaviour of cell signalling pathways, which ultimately will lead to a better understanding of the mechanisms underlying complex diseases.
DOI: 10.1038/s43856-021-00051-x
2021
Cited 12 times
Prior fluid and electrolyte imbalance is associated with COVID-19 mortality
The COVID-19 pandemic represents a major public health threat. Risk of death from the infection is associated with age and pre-existing comorbidities such as diabetes, dementia, cancer, and impairment of immunological, hepatic or renal function. It remains incompletely understood why some patients survive the disease, while others do not. As such, we sought to identify novel prognostic factors for COVID-19 mortality.We performed an unbiased, observational retrospective analysis of real world data. Our multivariable and univariable analyses make use of U.S. electronic health records from 122,250 COVID-19 patients in the early stages of the pandemic.Here we show that a priori diagnoses of fluid, pH and electrolyte imbalance during the year preceding the infection are associated with an increased risk of death independently of age and prior renal comorbidities.We propose that future interventional studies should investigate whether the risk of death can be alleviated by diligent and personalized management of the fluid and electrolyte balance of at-risk individuals during and before COVID-19.
DOI: 10.1371/journal.pone.0083016
2013
Cited 17 times
Gathering and Exploring Scientific Knowledge in Pharmacovigilance
Pharmacovigilance plays a key role in the healthcare domain through the assessment, monitoring and discovery of interactions amongst drugs and their effects in the human organism. However, technological advances in this field have been slowing down over the last decade due to miscellaneous legal, ethical and methodological constraints. Pharmaceutical companies started to realize that collaborative and integrative approaches boost current drug research and development processes. Hence, new strategies are required to connect researchers, datasets, biomedical knowledge and analysis algorithms, allowing them to fully exploit the true value behind state-of-the-art pharmacovigilance efforts. This manuscript introduces a new platform directed towards pharmacovigilance knowledge providers. This system, based on a service-oriented architecture, adopts a plugin-based approach to solve fundamental pharmacovigilance software challenges. With the wealth of collected clinical and pharmaceutical data, it is now possible to connect knowledge providers' analysis and exploration algorithms with real data. As a result, new strategies allow a faster identification of high-risk interactions between marketed drugs and adverse events, and enable the automated uncovering of scientific evidence behind them. With this architecture, the pharmacovigilance field has a new platform to coordinate large-scale drug evaluation efforts in a unique ecosystem, publicly available at http://bioinformatics.ua.pt/euadr/.
DOI: 10.3389/frai.2021.625573
2021
Cited 9 times
Artificial Intelligence for Prognostic Scores in Oncology: a Benchmarking Study
Introduction: Prognostic scores are important tools in oncology to facilitate clinical decision-making based on patient characteristics. To date, classic survival analysis using Cox proportional hazards regression has been employed in the development of these prognostic scores. With the advance of analytical models, this study aimed to determine if more complex machine-learning algorithms could outperform classical survival analysis methods. Methods: In this benchmarking study, two datasets were used to develop and compare different prognostic models for overall survival in pan-cancer populations: a nationwide EHR-derived de-identified database for training and in-sample testing and the OAK (phase III clinical trial) dataset for out-of-sample testing. A real-world database comprised 136K first-line treated cancer patients across multiple cancer types and was split into a 90% training and 10% testing dataset, respectively. The OAK dataset comprised 1,187 patients diagnosed with non-small cell lung cancer. To assess the effect of the covariate number on prognostic performance, we formed three feature sets with 27, 44 and 88 covariates. In terms of methods, we benchmarked ROPRO, a prognostic score based on the Cox model, against eight complex machine-learning models: regularized Cox, Random Survival Forests (RSF), Gradient Boosting (GB), DeepSurv (DS), Autoencoder (AE) and Super Learner (SL). The C-index was used as the performance metric to compare different models. Results: For in-sample testing on the real-world database the resulting C-index [95% CI] values for RSF 0.720 [0.716, 0.725], GB 0.722 [0.718, 0.727], DS 0.721 [0.717, 0.726] and lastly, SL 0.723 [0.718, 0.728] showed significantly better performance as compared to ROPRO 0.701 [0.696, 0.706]. Similar results were derived across all feature sets. However, for the out-of-sample validation on OAK, the stronger performance of the more complex models was not apparent anymore. Consistently, the increase in the number of prognostic covariates did not lead to an increase in model performance. Discussion: The stronger performance of the more complex models did not generalize when applied to an out-of-sample dataset. We hypothesize that future research may benefit by adding multimodal data to exploit advantages of more complex models.
DOI: 10.1124/mol.109.060103
2009
Cited 10 times
A Novel Multilevel Statistical Method for the Study of the Relationships between Multireceptorial Binding Affinity Profiles and In Vivo Endpoints
The present work introduces a novel method for drug research based on the sequential building of linked multivariate statistical models, each one introducing a different level of drug description. The use of multivariate methods allows us to overcome the traditional one-target assumption and to link in vivo endpoints with drug binding profiles, involving multiple receptors. The method starts with a set of drugs, for which in vivo or clinical observations and binding affinities for potentially relevant receptors are known, and allows obtaining predictions of the in vivo endpoints highlighting the most influential receptors. Moreover, provided that the structure of the receptor binding sites is known (experimentally or by homology modeling), the proposed method also highlights receptor regions and ligand-receptor interactions that are more likely to be linked to the in vivo endpoints, which is information of high interest for the design of novel compounds. The method is illustrated by a practical application dealing with the study of the metabolic side effects of antipsychotic drugs. Herein, the method detects related receptors confirmed by experimental results. Moreover, the use of structural models of the receptor binding sites allows identifying regions and ligand-receptor interactions that are involved in the discrimination between antipsychotic drugs that show metabolic side effects and those that do not. The structural results suggest that the topology of a hydrophobic sandwich involving residues in transmembrane helices (TM) 3, 5, and 6, as well as the assembling of polar residues in TM5, are important discriminators between target/antitarget receptors. Ultimately, this will provide useful information for the design of safer compounds inducing fewer side effects.
2013
Cited 8 times
Network analysis of unstructured EHR data for clinical research.
In biomedical research, network analysis provides a conceptual framework for interpreting data from high-throughput experiments. For example, protein-protein interaction networks have been successfully used to identify candidate disease genes. Recently, advances in clinical text processing and the increasing availability of clinical data have enabled analogous analyses on data from electronic medical records. We constructed networks of diseases, drugs, medical devices and procedures using concepts recognized in clinical notes from the Stanford clinical data warehouse. We demonstrate the use of the resulting networks for clinical research informatics in two ways-cohort construction and outcomes analysis-by examining the safety of cilostazol in peripheral artery disease patients as a use case. We show that the network-based approaches can be used for constructing patient cohorts as well as for analyzing differences in outcomes by comparing with standard methods, and discuss the advantages offered by network-based approaches.
2013
Cited 7 times
Learning signals of adverse drug-drug interactions from the unstructured text of electronic health records.
Drug-drug interactions (DDI) account for 30% of all adverse drug reactions, which are the fourth leading cause of death in the US. Current methods for post marketing surveillance primarily use spontaneous reporting systems for learning DDI signals and validate their signals using the structured portions of Electronic Health Records (EHRs). We demonstrate a fast, annotation-based approach, which uses standard odds ratios for identifying signals of DDIs from the textual portion of EHRs directly and which, to our knowledge, is the first effort of its kind. We developed a gold standard of 1,120 DDIs spanning 14 adverse events and 1,164 drugs. Our evaluations on this gold standard using millions of clinical notes from the Stanford Hospital confirm that identifying DDI signals from clinical text is feasible (AUROC=81.5%). We conclude that the text in EHRs contain valuable information for learning DDI signals and has enormous utility in drug surveillance and clinical decision support.
2013
Cited 5 times
Pharmacovigilance using Clinical Text.
The current state of the art in post-marketing drug surveillance utilizes voluntarily submitted reports of suspected adverse drug reactions. We present data mining methods that transform unstructured patient notes taken by doctors, nurses and other clinicians into a de-identified, temporally ordered, patient-feature matrix using standardized medical terminologies. We demonstrate how to use the resulting high-throughput data to monitor for adverse drug events based on the clinical notes in the EHR.
DOI: 10.1145/1871437.1871744
2010
Cited 3 times
Digging for knowledge with information extraction
We present the information extraction system Text2SemRel. The system (semi-) automatically constructs knowledge bases from textual data consisting of facts about entities using semantic relations. An integral part of the system is a graph-based interactive visualization and search layer. The second contribution in this paper is the presentation of a case study on the (semi-) automatic construction of a knowledge base consisting of gene-disease associations. The resulting knowledge base, the Literature-derived Human Gene-Disease Network (LHGDN), is now an integral part of the Linked Life Data initiative and represents currently the largest publicly available gene-disease repository. The LHGDN is compared against several curated state of the art databases. A unique feature of the LHGDN is that the semantics of the associations constitute a wide variety of biomolecular conditions.
DOI: 10.1038/clpt.2013.125
2013
Response to “Logistic Regression in Signal Detection: Another Piece Added to the Puzzle”
Clinical Pharmacology & TherapeuticsVolume 94, Issue 3 p. 313-313 Letters to the Editor Response to “Logistic Regression in Signal Detection: Another Piece Added to the Puzzle” R Harpaz, Corresponding Author R Harpaz [email protected] Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorW DuMouchel, W DuMouchel Oracle Health Sciences, Burlington, Massachusetts, USA Observational Medical Outcomes Partnership, Bethesda, Maryland, USASearch for more papers by this authorP LePendu, P LePendu Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorA Bauer-Mehren, A Bauer-Mehren Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorP Ryan, P Ryan Observational Medical Outcomes Partnership, Bethesda, Maryland, USA Janssen Research and Development, Raritan, New Jersey, USASearch for more papers by this authorN H Shah, N H Shah Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this author R Harpaz, Corresponding Author R Harpaz [email protected] Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorW DuMouchel, W DuMouchel Oracle Health Sciences, Burlington, Massachusetts, USA Observational Medical Outcomes Partnership, Bethesda, Maryland, USASearch for more papers by this authorP LePendu, P LePendu Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorA Bauer-Mehren, A Bauer-Mehren Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this authorP Ryan, P Ryan Observational Medical Outcomes Partnership, Bethesda, Maryland, USA Janssen Research and Development, Raritan, New Jersey, USASearch for more papers by this authorN H Shah, N H Shah Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USASearch for more papers by this author First published: 12 June 2013 https://doi.org/10.1038/clpt.2013.125Read the full textAboutPDF ToolsRequest permissionExport citationAdd to favoritesTrack citation ShareShare Give accessShare full text accessShare full-text accessPlease review our Terms and Conditions of Use and check box below to share full-text version of article.I have read and accept the Wiley Online Library Terms and Conditions of UseShareable LinkUse the link below to share a full-text version of this article with your friends and colleagues. Learn more.Copy URL No abstract is available for this article. References 1Caster, O., Norén, G.N., Madigan, D. & Bate, A. Logistic regression in signal detection: another piece added to the puzzle. http://dx.doi.org/10.1038/clpt.2013.107 Clin. Pharmacol. Ther. 94, 312 (2013). 2Harpaz, R., DuMouchel, W., LePendu, P., Bauer-Mehren, A., Ryan, P. & Shah, N.H. Performance of pharmacovigilance signal-detection algorithms for the FDA Adverse Event Reporting System. Clin. Pharmacol. Ther. 93, 539– 546 (2013). 3DuMouchel, W. & Harpaz, R. Regression-Adjusted GPS Algorithm (RGPS). Oracle Health Sciences (2012) <http://www.oracle.com/us/industries/health-sciences/hs-regression-adjusted-gps-wp-1949689.pdf>. Volume94, Issue3Reinventing BioinnovationSeptember 2013Pages 313-313 ReferencesRelatedInformation
DOI: 10.1371/annotation/8824e161-bf8f-4f78-81e0-a62a1540276d
2013
Correction: Drug-Induced Acute Myocardial Infarction: Identifying ‘Prime Suspects’ from Electronic Healthcare Records-Based Surveillance System
Background: Drug-related adverse events remain an important cause of morbidity and mortality and impose huge burden on healthcare costs.Routinely collected electronic healthcare data give a good snapshot of how drugs are being used in 'real-world' settings.Objective: To describe a strategy that identifies potentially drug-induced acute myocardial infarction (AMI) from a large international healthcare data network.Methods: Post-marketing safety surveillance was conducted in seven population-based healthcare databases in three countries (Denmark, Italy, and the Netherlands) using anonymised demographic, clinical, and prescription/dispensing data representing 21,171,291 individuals with 154,474,063 person-years of follow-up in the period 1996-2010.Primary care physicians' medical records and administrative claims containing reimbursements for filled prescriptions, laboratory tests, and hospitalisations were evaluated using a three-tier triage system of detection, filtering, and substantiation that generated a list of drugs potentially associated with AMI.Outcome of interest was statistically significant increased risk of AMI during drug exposure that has not been previously described in current literature and is biologically plausible.Results: Overall, 163 drugs were identified to be associated with increased risk of AMI during preliminary screening.Of these, 124 drugs were eliminated after adjustment for possible bias and confounding.With subsequent application of criteria for novelty and biological plausibility, association with AMI remained for nine drugs ('prime suspects'): azithromycin; erythromycin; roxithromycin; metoclopramide; cisapride; domperidone; betamethasone; fluconazole; and megestrol acetate.Limitations: Although global health status, co-morbidities, and time-invariant factors were adjusted for, residual confounding cannot be ruled out. Conclusion:A strategy to identify potentially drug-induced AMI from electronic healthcare data has been proposed that takes into account not only statistical association, but also public health relevance, novelty, and biological plausibility.Although this strategy needs to be further evaluated using other healthcare data sources, the list of 'prime suspects' makes a good starting point for further clinical, laboratory, and epidemiologic investigation.
DOI: 10.1200/cci.23.00062
2023
Correlation Between Early Trends of a Prognostic Biomarker and Overall Survival in Non–Small-Cell Lung Cancer Clinical Trials
PURPOSE Overall survival (OS) is the primary end point in phase III oncology trials. Given low success rates, surrogate end points, such as progression-free survival or objective response rate, are used in early go/no-go decision making. Here, we investigate whether early trends of OS prognostic biomarkers, such as the ROPRO and DeepROPRO, can also be used for this purpose. METHODS Using real-world data, we emulated a series of 12 advanced non–small-cell lung cancer (aNSCLC) clinical trials, originally conducted by six different sponsors and evaluated four different mechanisms, in a total of 19,920 individuals. We evaluated early trends (until 6 months) of the OS biomarker alongside early OS within the joint model (JM) framework. Study-level estimates of early OS and ROPRO trends were correlated against the actual final OS hazard ratios (HRs). RESULTS We observed a strong correlation between the JM estimates and final OS HR at 3 months (adjusted [Formula: see text] = 0.88) and at 6 months (adjusted [Formula: see text] = 0.85). In the leave-one-out analysis, there was a low overall prediction error of the OS HR at both 3 months (root-mean-square error [RMSE] = 0.11) and 6 months (RMSE = 0.12). In addition, at 3 months, the absolute prediction error of the OS HR was lower than 0.05 for three trials. CONCLUSION We describe a pipeline to predict trial OS HRs using emulated aNSCLC studies and their early OS and OS biomarker trends. The method has the potential to accelerate and improve decision making in drug development.
DOI: 10.1002/cpt.3109
2023
Matching by <scp>OS</scp> Prognostic Score to Construct External Controls in Lung Cancer Clinical Trials
External controls (eControls) leverage historical data to create non-randomized control arms. The lack of randomization can result in confounding between the experimental and eControl cohorts. To balance potentially confounding variables between the cohorts, one of the proposed methods is to match on prognostic scores. Still, the performance of prognostic scores to construct eControls in oncology has not been analyzed yet. Using an electronic health record-derived de-identified database, we constructed eControls using one of three methods: ROPRO, a state-of-the-art prognostic score, or either a propensity score composed of five (5Vars) or 27 covariates (ROPROvars). We compared the performance of these methods in estimating the overall survival (OS) hazard ratio (HR) of 11 recent advanced non-small cell lung cancer. The ROPRO eControls had a lower OS HR error (median absolute deviation (MAD), 0.072, confidence interval (CI): 0.036-0.185), than the 5Vars (MAD 0.081, CI: 0.025-0.283) and ROPROvars eControls (MAD 0.087, CI: 0.054-0.383). Notably, the OS HR errors for all methods were even lower in the phase III studies. Moreover, the ROPRO eControl cohorts included, on average, more patients than the 5Vars (6.54%) and ROPROvars cohorts (11.7%). The eControls matched with the prognostic score reproduced the controls more reliably than propensity scores composed of the underlying variables. Additionally, prognostic scores could allow eControls to be built on many prognostic variables without a significant increase in the variability of the propensity score, which would decrease the number of matched patients.
DOI: 10.1371/annotation/695450aa-95a0-491d-804d-470cbfa861e8
2012
Correction: Automatic Filtering and Substantiation of Drug Safety Signals
Drug safety issues pose serious health threats to the population and constitute a major cause of mortality worldwide.Due to the prominent implications to both public health and the pharmaceutical industry, it is of great importance to unravel the molecular mechanisms by which an adverse drug reaction can be potentially elicited.These mechanisms can be investigated by placing the pharmaco-epidemiologically detected adverse drug reaction in an information-rich context and by exploiting all currently available biomedical knowledge to substantiate it.We present a computational framework for the biological annotation of potential adverse drug reactions.First, the proposed framework investigates previous evidences on the drug-event association in the context of biomedical literature (signal filtering).Then, it seeks to provide a biological explanation (signal substantiation) by exploring mechanistic connections that might explain why a drug produces a specific adverse reaction.The mechanistic connections include the activity of the drug, related compounds and drug metabolites on protein targets, the association of protein targets to clinical events, and the annotation of proteins (both protein targets and proteins associated with clinical events) to biological pathways.Hence, the workflows for signal filtering and substantiation integrate modules for literature and database mining, in silico drug-target profiling, and analyses based on gene-disease networks and biological pathways.Application examples of these workflows carried out on selected cases of drug safety signals are discussed.The methodology and workflows presented offer a novel approach to explore the molecular mechanisms underlying adverse drug reactions.
2012
Abstract 15727: Analyzing Unstructured Clinical Notes for Phase IV Drug Safety Surveillance
Background: The current state of the art in post-marketing drug surveillance utilizes large collections of submitted reports to detect adverse drug reactions. However, given the limitations of reporting systems, there is an opportunity to meaningfully use electronic health records (EHRs) for next-generation signal detection and advancement of drug safety surveillance. Methods and Results: We present data mining methods that transform unstructured patient notes taken by doctors, nurses and other clinicians into a de-identified, temporally ordered, patient-feature matrix using standardized medical terminologies. We demonstrate how to use the resulting high-throughput data to monitor for adverse drug events based on the clinical notes in the EHR. We analyze the patterns of mentions of disease conditions, drug names, their co-mentions and the temporal ordering of the drugs and diseases, in the output of our text processing pipeline to detect known associations between drugs and their adverse effects. Overall, our methods provide highly accurate results (72% Sensitivity, 83% Specificity) as measured by testing a set of 25 known drug recalls and a set of 200 “true negative” drug-adverse event associations. We are able to detect associations between a drug and the adverse event that results in its recall, on average, 2 years ahead of the official notification. We also describe methods for investigating a suspect drug-adverse event association using stratification, propensity score matching and show that matching based on co-morbidities and co-prescriptions may correct for confounding from unobserved variables; thus making the data-mining methods robust against confounding. Conclusion: We conclude that data-mining of unstructured clinical notes via ontology driven methods can enable meaningful use of the EHR for post-marketing drug surveillance. Such data-mining can be used for hypothesis generation and for rapid retrospective analysis of suspected adverse event risk.
2013
Learning Practice-based Evidence from Unstructured Clinical Notes.
2012
Triage and Evaluation of Potential Safety Signals Identified from Electronic Healthcare Record Databases
DOI: 10.1371/journal.pone.0083016.g005
2013
EU-ADR Web Platform workspace interface for an undisclosed drug (XYZ) exploration scenario containing the signal list that results from distributed knowledge provider algorithm outputs and evidence combination statistical analysis.
2010
Integrative approaches to investigate the molecular basis of diseases and adverse drug reactions: from multivariate statistical analysis to systems biology
2010
DisGeNET:遺伝子疾患ネットワークを可視化,統合,検索および分析するCytoscapeプラグイン | 文献情報 | J-GLOBAL 科学技術総合リンクセンター
2010
DisGeNET:遺伝子疾患ネットワークを可視化,統合,検索および分析するCytoscapeプラグイン
DOI: 10.1158/1538-7445.am2018-1027
2018
Abstract 1027: Regulatory T-cell genes drive altered immune microenvironment in adult solid cancers and allow for immune contextual patient subtyping
Abstract The tumor microenvironment is an important factor in cancer immunotherapy response. To further understand how a tumor affects the local immune system, we analyzed immune gene expression differences between matching normal and tumor tissues. We analyzed previously published and new gene expression data from solid cancers and isolated immune cell populations. We also determined the correlation between CD8, FoxP3 immunohistochemistry (IHC) and immune-related genes. Across solid TCGA cancers, we observed that regulatory T-cells (Tregs) were one of the main drivers of immune gene expression differences between normal and tumor tissues. A tumor-specific CD8 signature had slightly lower scores in tumor tissues compared to normal of most (12 of 16) cancers, while a Treg signature score was higher in tumor tissues of all cancers except liver. We clustered TCGA colorectal samples (626 patients) and a new separate testing data set (60 patients) into two groups according to Treg gene signature expression. The High Treg cluster had more colorectal tumors that were Consensus Molecular Subtype 1/4, right-sided and microsatellite-instable, compared to the Low Treg cluster. Finally, we determined the correlation between CD8, FoxP3 immunohistochemistry (IHC) and our gene signatures and found that in this small data set correlation between signature and IHC overall was low, but samples in the High Treg cluster had significantly more CD8+ and FoxP3+ cells compared to the Low Treg cluster. We conclude that high Treg signature expression scores correlate with high overall immune gene expression. Using this novel way of classifying patients, more colorectal tumors with high immune activation were identified compared to other colorectal subtyping methods. Further research will reveal if this Treg-based subtyping improves the identification of patients that may benefit from cancer immunotherapy. Citation Format: Jurriaan Brouwer, Wei-Yi Cheng, Anna Bauer-Mehren, Daniela Maisel, Katharina Lechner, Emilia Andersson, Joel T. Dudley, Francesca Milletti. Regulatory T-cell genes drive altered immune microenvironment in adult solid cancers and allow for immune contextual patient subtyping [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr 1027.
DOI: 10.1101/2022.10.11.22280399
2022
Longitudinal assessment of ROPRO as an early indicator of overall survival in oncology clinical trials: a retrospective analysis
Abstract Background The gold standard to evaluate treatment efficacy in oncology clinical trials is Overall Survival (OS). Its utility, however, is limited by the need for long trial duration and large sample sizes. Thus methods such as Progression-Free Survival (PFS) are applied to obtain early OS estimates across clinical trial phases, particularly to decide on further development of new molecular entities. Especially for cancer-immunotherapy, these established methods may be less suitable. Therefore, alternative approaches to obtain early OS estimates are required. In this work, we present a first evaluation of a new method, ΔRisk. ΔRisk uses the ROPRO, a state-of-the-art pan-cancer OS prognostic score, or DeepROPRO to predict OS benefit by measuring the patient’s improvement since baseline. Patients and methods We modeled the ΔRisk using Joint Models and tested whether a significant ΔRisk decrease correlated with OS improvement. We studied this hypothesis by comparing classical OS analysis against ΔRisk in a retrospective analysis of 12 real-world data emulated clinical trials, and 3 additional recent phase III immunotherapy clinical trials. Results Our new ΔRisk method correlated with the final OS readout in 14 out of 15 clinical trials. The ΔRisk, however, identified the treatment benefit up to seven months earlier than the OS log-rank test. Additionally, in two immunotherapy trials where PFS would have failed as an early OS estimate, the ΔRisk correctly predicted the treatment benefit. Conclusions We introduced a new method, ΔRisk, and demonstrated its correlation with OS. In retrospective analysis, ΔRisk is able to identify OS benefit earlier than standard methodology, and we show examples of lung cancer trials, where it maintains its predictive relevance whereas PFS does not correlate with OS. ΔRisk may prove useful for early decision support resulting in reduced need of resources. We also show the potential of ΔRisk as a candidate to define surrogate endpoints. To this purpose, more methodological work and further investigation of treatment-specific performance will be done in the future.
DOI: 10.1016/j.annonc.2020.10.467
2021
Prognostic models: clinical impact now within reach
We thank Dr. Halabi1Halabi S. Pan-cancer prognostic models of clinical outcomes: statistical exercise or clinical tools?.Ann Oncol. 2020; 31: 1427-1429Abstract Full Text Full Text PDF PubMed Scopus (1) Google Scholar for her valuable comments on our article2Becker T. Weberpals J. Jegg A.M. et al.An enhanced prognostic score for overall survival of patients with cancer derived from a large real-world cohort.Ann Oncol. 2020; 31: 1561-1568Abstract Full Text Full Text PDF PubMed Scopus (7) Google Scholar pointing out three main challenges of prognostic models for clinical outcome: (1) generalisability across cancer types, (2) clinical utility and (3) overfitting and availability of orthogonal patient data. We presented the Real wOrld PROgnostic score (ROPRO) as a baseline prognostic score for overall survival (OS), composed of 27 routinely measured clinical parameters. ROPRO is a pan-cancer score which yields consistent performance across 17 different cancer types,2Becker T. Weberpals J. Jegg A.M. et al.An enhanced prognostic score for overall survival of patients with cancer derived from a large real-world cohort.Ann Oncol. 2020; 31: 1561-1568Abstract Full Text Full Text PDF PubMed Scopus (7) Google Scholar (supplementary material). In fact, we set out to develop cancer-specific models; however, since we observed high similarity of parameter estimates across indications, we changed our focus towards a general pan-cancer model. ROPRO may largely perform well because it captures features that are common across different indications and thus measures ‘general patient fitness’, relevant for OS prognosis. Assessment of our patients' performance status (PS) to predict survival has always been important for toxicity monitoring, treatment selection and clinical trial eligibility. Some existing tools to determine PS may not be suitable for novel therapies, such as cancer immunotherapy3Scott J.M. Stene G. Edvardsen E. Jones L.W. Performance status in cancer: not broken, but time for an upgrade?.J Clin Oncol. 2020; 38: 2824-2829Crossref PubMed Scopus (11) Google Scholar due to their subjective assessment with limited reliability and restricted predictive value in patients with better PS. Tools such as ROPRO may address an unmet need by being more objective and discriminatory. ROPRO is a strong predictor of short-term OS (3–18 months) and can serve as an accompanying biomarker. Long-term prognostic power of ROPRO and validation in large patient cohorts of different ethnicities is needed. We have thus started and encourage testing ROPRO in different patient cohorts for further validation. The need for dynamic models as pointed out by Halabi1Halabi S. Pan-cancer prognostic models of clinical outcomes: statistical exercise or clinical tools?.Ann Oncol. 2020; 31: 1427-1429Abstract Full Text Full Text PDF PubMed Scopus (1) Google Scholar is also highly relevant to clinicians. As ROPRO consists of routinely collected clinical parameters, it is possible to compute ROPRO (patient fitness) over time, which may inform patient management, such as switching therapy early in case of inactivity. Overfitting refers to choosing a highly parameterised model that fits one particular dataset exceptionally well, but fails to generalise, i.e. reliably predict on new data. To control and test for overfitting, we first used FWER-controlled forward/backward selection and cross-validated LASSO regularisation to penalise model complexity, promoting generalisability over predictive performance. Second, to test for overfitting, we utilised time-stratified data slices from FlatironHealth unavailable at the time of model building and data from 17 clinical studies (Figure 1) showing strong prognostic power with a subtle decrease of the C-index, partially attributed to the fact that patients with an Eastern Cooperative Oncology Group score > 1 are typically excluded from clinical studies. Additional information regarding the patient's medical history, biomarker and tumour-specific genetics are expected to substantially add to indication-specific prognostic power; at the same time, this will also reduce the relevant sample size. We also reported some cancer-specific models.2Becker T. Weberpals J. Jegg A.M. et al.An enhanced prognostic score for overall survival of patients with cancer derived from a large real-world cohort.Ann Oncol. 2020; 31: 1561-1568Abstract Full Text Full Text PDF PubMed Scopus (7) Google Scholar Pan-cancer and cancer-specific scores may each cover a clinical need on its own. There is clearly room for improvement regarding prognostic models. Researchers need wider access to large patient datasets capturing detailed information about each patient, the disease and the treatment. In particular, there remains an unmet clinical need for precise survival prediction to enable improved toxicity monitoring, treatment selection and assessment of clinical trial eligibility. None declared.
DOI: 10.21203/rs.3.rs-145823/v1
2021
Fluid, pH and electrolyte imbalance associated with COVID-19 mortality
Abstract The threat of COVID-19 has harried the world since early 2020. Risk of death from the infection is associated with age and pre-existing comorbidities such as diabetes, dementia, cancer, and impairment of immunological, hepatic or renal function. It still remains incompletely understood why some patients survive the disease, while others perish. Our univariate and multivariate analyses of real world data from U.S. electronic health records indicate that a priori diagnoses of fluid, pH and electrolyte imbalance are highly and independently associated with COVID-19 mortality. We propose that pre-existing homeostatic aberrations are magnified upon the loss of ACE2, which is a core component of the electrolyte management system as well as the entry point of internalizing SARS-CoV-2 viruses. Moreover, we also suggest such fragility of electrolyte homeostasis may increase the risk of plasma volume disturbances during the infection. Future interventional studies should investigate whether the risk of death can be alleviated by personalized management of the fluid and electrolyte balance of at-risk individuals before and during COVID-19.
DOI: 10.1016/j.annonc.2021.08.625
2021
CN2 ROPRO – Real-World Data Prognostic Score: A novel tool to assess patients' performance status
The assessment of our patients' performance status (PS) to predict survival has always been important for toxicity monitoring, treatment selection and clinical trial eligibility. Some existing tools to determine PS may not be suitable for decision-making due to their reliance on subjective assessment, leading to limited reliability and restricted predictive value in patients with better PS.