ϟ

Desmond G. Higgins

Here are all the papers by Desmond G. Higgins that you can download and read on OA.mg.
Desmond G. Higgins’s last known institution is . Download Desmond G. Higgins PDFs here.

Claim this Profile →
DOI: 10.1093/nar/22.22.4673
1994
Cited 59,709 times
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
DOI: 10.1093/nar/25.24.4876
1997
Cited 36,779 times
The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools
CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.
DOI: 10.1093/bioinformatics/btm404
2007
Cited 25,102 times
Clustal W and Clustal X version 2.0
Abstract Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Contact: clustalw@ucd.ie
DOI: 10.1038/msb.2011.75
2011
Cited 12,493 times
Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega
Multiple Sequence Alignments are fundamental to many sequence analysis methods.Most alignments are computed using the Progressive Alignment heuristic.These methods are starting to become a bottleneck in some analysis pipelines when faced with data-sets of the size of many thousands of sequences.Some methods allow computation of larger datasets while sacrificing quality, and others produce high quality alignments, but scale badly with the number of sequences.In this paper, we describe a new program called Clustal Omega which can align virtually any number of protein sequences quickly and that delivers accurate alignments.The accuracy of the package on smaller test-cases is similar to that of the high-quality aligners.On larger data-sets Clustal Omega outperforms other packages in terms of execution time and quality.Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
DOI: 10.1006/jmbi.2000.4042
2000
Cited 6,499 times
T-coffee: a novel method for fast and accurate multiple sequence alignment 1 1Edited by J. Thornton
We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.
DOI: 10.1093/nar/gkg500
2003
Cited 4,580 times
Multiple sequence alignment with the Clustal series of programs
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).
DOI: 10.1016/0378-1119(88)90330-7
1988
Cited 3,298 times
CLUSTAL: a package for performing multiple sequence alignment on a microcomputer
An approach for performing multiple alignments of large numbers of amino acid or nucleotide sequences is described. The method is based on first deriving a phylogenetic tree from a matrix of all pairwise sequence similarity scores, obtained using a fast pairwise alignment algorithm. Then the multiple alignment is achieved from a series of pairwise alignments of clusters of sequences, following the order of branching in the tree. The method is sufficiently fast and economical with memory to be easily implemented on a microcomputer, and yet the results obtained are comparable to those from packages requiring mainframe computer facilities.
DOI: 10.1016/s0968-0004(98)01285-7
1998
Cited 2,504 times
Multiple sequence alignment with Clustal X
The Clustal series of programs are widely used for multiple alignment and for preparing phylogenetic trees. The programs have undergone several incarnations, and 1997 saw the release of the Clustal W 1.7 upgrade and of Clustal X, which has a windows interface. Although we like to think that people use Clustal programs because they produce good alignments, undoubtedly one of the reasons for the programs' wide usage has been their portability to all computers. Portability can have a downside, and for many years the Clustal interface has had to be kept simple. Clustal X (Ref. [ 1 Thompson J.D. et al. Nucleic Acids Res. 1997; 25: 4876-4882 Crossref PubMed Scopus (35948) Google Scholar ]) now provides a much nicer, graphical user interface (see Fig. 1) for X-, Mac and PC windows, while maintaining portability. It presents alignments in which residue conservation is shown in colour, and has a very useful new tool for marking poor regions of the alignment. In addition, the user can select such regions for realignment. Thus, Clustal X adds further flexibility to the available strategies for preparing multiple alignments.
DOI: 10.1002/0471250953.bi0203s00
2002
Cited 1,566 times
Multiple Sequence Alignment Using ClustalW and ClustalX
The Clustal programs are widely used for carrying out automatic multiple alignment of nucleotide or amino acid sequences. The most familiar version is ClustalW, which uses a simple text menu system that is portable to more or less all computer systems. ClustalX features a graphical user interface and some powerful graphical utilities for aiding the interpretation of alignments and is the preferred version for interactive usage. Users may run Clustal remotely from several sites using the Web or the programs may be downloaded and run locally on PCs, Macintosh, or Unix computers. The protocols in this unit discuss how to use ClustalX and ClustalW to construct an alignment, and create profile alignments by merging existing alignments.
DOI: 10.1016/s0076-6879(96)66024-8
1996
Cited 1,502 times
[22] Using CLUSTAL for multiple sequence alignments
We have tested CLUSTAL W in a wide variety of situations, and it is capable of handling some very difficult protein alignment problems. If the data set consists of enough closely related sequences so that the first alignments are accurate, then CLUSTAL W will usually find an alignment that is very close to ideal. Problems can still occur if the data set includes sequences of greatly different lengths or if some sequences include long regions that are impossible to align with the rest of the data set. Trying to balance the need for long insertions and deletions in some alignments with the need to avoid them in others is still a problem. The default values for our parameters were tested empirically using test cases of sets of globular proteins where some information as to the correct alignment was available. The parameter values may not be very appropriate with nonglobular proteins. We have argued that using one weight matrix and two gap penalties is too simplistic to be of general use in the most difficult cases. We have replaced these parameters with a large number of new parameters designed primarily to help encourage gaps in loop regions. Although these new parameters are largely heuristic in nature, they perform surprisingly well and are simple to implement. The underlying speed of the progressive alignment approach is not adversely affected. The disadvantage is that the parameter space is now huge; the number of possible combinations of parameters is more than can easily be examined by hand. We justify this by asking the user to treat CLUSTAL W as a data exploration tool rather than as a definitive analysis method. It is not sensible to automatically derive multiple alignments and to trust particular algorithms as being capable of always getting the correct answer. One must examine the alignments closely, especially in conjunction with the underlying phylogenetic tree (or estimate of it) and try varying some of the parameters. Outliers (sequences that have no close relatives) should be aligned carefully, as should fragments of sequences. The program will automatically delay the alignment of any sequences that are less than 40% identical to any others until all other sequences are aligned, but this can be set from a menu by the user. It may be useful to build up an alignment of closely related sequences first and to then add in the more distant relatives one at a time or in batches, using the profile alignments and weighting scheme described earlier and perhaps using a variety of parameter settings. We give one example using SH2 domains. SH2 domains are widespread in eukaryotic signalling proteins where they function in the recognition of phosphotyrosine-containing peptides. In the chapter by Bork and Gibson ([11], this volume), Blast and pattern/profile searches were used to extract the set of known SH2 domains and to search for new members. (Profiles used in database searches are conceptually very similar to the profiles used in CLUSTAL W: see the chapters [11] and [13] for profile search methods.) The profile searches detected SH2 domains in the JAK family of protein tyrosine kinases, which were thought not to contain SH2 domains. Although the JAK family SH2 domains are rather divergent, they have the necessary core structural residues as well as the critical positively charged residue that binds phosphotyrosine, leaving no doubt that they are bona fide SH2 domains. The five new JAK family SH2 domains were added sequentially to the existing alignment of 65 SH2 domains using the CLUSTAL W profile alignment option. Figure 6 shows part of the resulting alignment. Despite their divergent sequences, the new SH2 domains have been aligned nearly perfectly with the old set. No insertions were placed in the original SH2 domains. In this example, the profile alignment procedure has produced better results than a one-step full alignment of all 70 SH2 domains, and in considerably less time. (ABSTRACT TRUNCATED)
DOI: 10.1007/978-1-62703-646-7_6
2013
Cited 952 times
Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences
Clustal Omega is a completely rewritten and revised version of the widely used Clustal series of programs for multiple sequence alignment. It can deal with very large numbers (many tens of thousands) of DNA/RNA or protein sequences due to its use of the mBED algorithm for calculating guide trees. This algorithm allows very large alignment problems to be tackled very quickly, even on personal computers. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models. The program currently is used from the command line or can be run on line.
DOI: 10.1093/nar/16.17.8207
1988
Cited 569 times
Codon usage patterns in<i>Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster</i>and<i>Homo sapiens</i>; a review of the considerable within-species diversity
The genetic code is degenerate, but alternative synonymous codons are generally not used with equal frequency. Since the pioneering work of Grantham'a group (1, 2) it has been apparent that genes from one species often share similarities incodon frequency; under the “genome hypothesis” (1, 2) there is a species-specific pattern to codon usage. However, it has become clear that in most species there are also considerable differences among genes (3–7). Multivariate analyses have revealed that in each species so far examined there is a single major trend in codon usage among genes, usually from highly biased to more nearly even usage of synonymous codons. Thus, to represent the codon usage pattern of an organism it is not sufficient to sum over all genes (8), as this conceals the underlying heterogeneity. Rather, it is necessary to describe the trend among genes seen in that species. We illustrate these trends for six species where codon usage has been examined in detail, by presenting the pooled codon usage for the 10% of genes at either end of the major trend (Table 1). Closely-related organisms have similar patterns of codon usage, and so the six species in Table 1 are representative of wider groups. For example, with respect to codon usage, Salmonella typhimurlum closely resembles E. coli (9), while all mammalian species so far examined (principally mouse, rat and cow) largely resemble humans (4, 8).
DOI: 10.1093/nar/gkl091
2006
Cited 529 times
M-Coffee: combining multiple sequence alignment methods with T-Coffee
We introduce M-Coffee, a meta-method for assembling multiple sequence alignments (MSA) by combining the output of several individual methods into one single MSA. M-Coffee is an extension of T-Coffee and uses consistency to estimate a consensus alignment. We show that the procedure is robust to variations in the choice of constituent methods and reasonably tolerant to duplicate MSAs. We also show that performances can be improved by carefully selecting the constituent methods. M-Coffee outperforms all the individual methods on three major reference datasets: HOMSTRAD, Prefab and Balibase. We also show that on a case-by-case basis, M-Coffee is twice as likely to deliver the best alignment than any individual method. Given a collection of pre-computed MSAs, M-Coffee has similar CPU requirements to the original T-Coffee. M-Coffee is a freeware open-source package available from http://www.tcoffee.org/.
DOI: 10.1002/0471250953.bi0313s48
2014
Cited 484 times
Clustal Omega
Clustal Omega is a package for making multiple sequence alignments of amino acid or nucleotide sequences, quickly and accurately. It is a complete upgrade and rewrite of earlier Clustal programs. This unit describes how to run Clustal Omega interactively from a command line, although it can also be run online from several sites. The unit describes a basic protocol for taking a set of unaligned sequences and producing a full alignment. There are also protocols for using an external HMM or iteration to help improve an alignment.
DOI: 10.1093/bioinformatics/bti394
2005
Cited 337 times
MADE4: an R package for multivariate analysis of gene expression data
MADE4, microarray ade4, is a software package that facilitates multivariate analysis of microarray gene-expression data. MADE4 accepts a wide variety of gene-expression data formats. MADE4 takes advantage of the extensive multivariate statistical and graphical functions in the R package ade4, extending these for application to microarray data. In addition, MADE4 provides new graphical and visualization tools that aid in interpretation of multivariate analysis of microarray data.
DOI: 10.1073/pnas.1105380108
2011
Cited 331 times
Functional genome analysis of <i>Bifidobacterium breve</i> UCC2003 reveals type IVb tight adherence (Tad) pili as an essential and conserved host-colonization factor
Development of the human gut microbiota commences at birth, with bifidobacteria being among the first colonizers of the sterile newborn gastrointestinal tract. To date, the genetic basis of Bifidobacterium colonization and persistence remains poorly understood. Transcriptome analysis of the Bifidobacterium breve UCC2003 2.42-Mb genome in a murine colonization model revealed differential expression of a type IVb tight adherence (Tad) pilus-encoding gene cluster designated “ tad 2003 .” Mutational analysis demonstrated that the tad 2003 gene cluster is essential for efficient in vivo murine gut colonization, and immunogold transmission electron microscopy confirmed the presence of Tad pili at the poles of B. breve UCC2003 cells. Conservation of the Tad pilus-encoding locus among other B. breve strains and among sequenced Bifidobacterium genomes supports the notion of a ubiquitous pili-mediated host colonization and persistence mechanism for bifidobacteria.
DOI: 10.1186/1471-2105-7-359
2006
Cited 327 times
Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets. In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary (two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers. We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
DOI: 10.1016/j.jmb.2004.04.058
2004
Cited 310 times
3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments
Most bioinformatics analyses require the assembly of a multiple sequence alignment. It has long been suspected that structural information can help to improve the quality of these alignments, yet the effect of combining sequences and structures has not been evaluated systematically. We developed 3DCoffee, a novel method for combining protein sequences and structures in order to generate high-quality multiple sequence alignments. 3DCoffee is based on TCoffee version 2.00, and uses a mixture of pairwise sequence alignments and pairwise structure comparison methods to generate multiple sequence alignments. We benchmarked 3DCoffee using a subset of HOMSTRAD, the collection of reference structural alignments. We found that combining TCoffee with the threading program Fugue makes it possible to improve the accuracy of our HOMSTRAD dataset by four percentage points when using one structure only per dataset. Using two structures yields an improvement of ten percentage points. The measures carried out on HOM39, a HOMSTRAD subset composed of distantly related sequences, show a linear correlation between multiple sequence alignment accuracy and the ratio of number of provided structure to total number of sequences. Our results suggest that in the case of distantly related sequences, a single structure may not be enough for computing an accurate multiple sequence alignment.
DOI: 10.1111/j.1365-2672.2005.02600.x
2005
Cited 306 times
Getting better with bifidobacteria
... The last 20 years has seen a tremendous increase in commercial and consequent scientific interest in members of the genus Bifidobacterium. Bifidobacteria are Gram‐positive procaryotes that naturally inhabit the gastrointestinal tract of humans and other warm‐blooded animals. Discovered at the start of the last century, bifidobacteria are considered as key commensals in human–microbe interactions, and are believed to play a pivotal role in maintaining a healthy gastrointestinal tract. Despite the generally accepted importance of bifidobacteria in gastrointestinal well‐being, the underlying molecular mechanisms by which these bacteria function as probiotic commensal organisms is far from understood. Recent genome sequencing has given us a revealing insight into the genetic make‐up of some members of the genus Bifidobacterium, although the availability of the full genomic sequence of complete bifidobacterial sequences represents only the first step in moving towards a better understanding of the biology of these organisms. This review will discuss the role that Bifidobacterium species play as a prominent probiotic component of our gastrointestinal microflora and provide some forthcoming insights into the general characteristics of Bifidobacterium genomes.
DOI: 10.1093/nar/gkt1035
2013
Cited 223 times
GWIPS-viz: development of a ribo-seq genome browser
We describe the development of GWIPS-viz (http://gwips.ucc.ie), an online genome browser for viewing ribosome profiling data. Ribosome profiling (ribo-seq) is a recently developed technique that provides genome-wide information on protein synthesis (GWIPS) in vivo. It is based on the deep sequencing of ribosome-protected messenger RNA (mRNA) fragments, which allows the ribosome density along all mRNA transcripts present in the cell to be quantified. Since its inception, ribo-seq has been carried out in a number of eukaryotic and prokaryotic organisms. Owing to the increasing interest in ribo-seq, there is a pertinent demand for a dedicated ribo-seq genome browser. GWIPS-viz is based on The University of California Santa Cruz (UCSC) Genome Browser. Ribo-seq tracks, coupled with mRNA-seq tracks, are currently available for several genomes: human, mouse, zebrafish, nematode, yeast, bacteria (Escherichia coli K12, Bacillus subtilis), human cytomegalovirus and bacteriophage lambda. Our objective is to continue incorporating published ribo-seq data sets so that the wider community can readily view ribosome profiling information from multiple studies without the need to carry out computational processing.
DOI: 10.1002/pro.5560040817
1995
Cited 281 times
Finding flexible patterns in unaligned protein sequences
We present a new method for the identification of conserved patterns in a set of unaligned related protein sequences. It is able to discover patterns of a quite general form, allowing for both ambiguous positions and for variable length wildcard regions. It allows the user to define a class of patterns (e.g., the degree of ambiguity allowed and the length and number of gaps), and the method is then guaranteed to find the conserved patterns in this class scoring highest according to a significance measure defined. Identified patterns may be refined using one of two new algorithms. We present a new (nonstatistical) significance measure for flexible patterns. The method is shown to recover known motifs for PROSITE families and is also applied to some recently described families from the literature.
DOI: 10.1002/j.1460-2075.1994.tb06541.x
1994
Cited 235 times
Evolution of cytochrome oxidase, an enzyme older than atmospheric oxygen.
Research Article1 June 1994free access Evolution of cytochrome oxidase, an enzyme older than atmospheric oxygen. J. Castresana J. Castresana European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author M. Lübben M. Lübben European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author M. Saraste M. Saraste European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author D.G. Higgins D.G. Higgins European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author J. Castresana J. Castresana European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author M. Lübben M. Lübben European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author M. Saraste M. Saraste European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author D.G. Higgins D.G. Higgins European Molecular Biology Laboratory, Heidelberg, Germany. Search for more papers by this author Author Information J. Castresana1, M. Lübben1, M. Saraste1 and D.G. Higgins1 1European Molecular Biology Laboratory, Heidelberg, Germany. The EMBO Journal (1994)13:2516-2525https://doi.org/10.1002/j.1460-2075.1994.tb06541.x PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Cytochrome oxidase is a key enzyme in aerobic metabolism. All the recorded eubacterial (domain Bacteria) and archaebacterial (Archaea) sequences of subunits 1 and 2 of this protein complex have been used for a comprehensive evolutionary analysis. The phylogenetic trees reveal several processes of gene duplication. Some of these are ancient, having occurred in the common ancestor of Bacteria and Archaea, whereas others have occurred in specific lines of Bacteria. We show that eubacterial quinol oxidase was derived from cytochrome c oxidase in Gram-positive bacteria and that archaebacterial quinol oxidase has an independent origin. A considerable amount of evidence suggests that Proteobacteria (Purple bacteria) acquired quinol oxidase through a lateral gene transfer from Gram-positive bacteria. The prevalent hypothesis that aerobic metabolism arose several times in evolution after oxygenic photosynthesis, is not sustained by two aspects of the molecular data. First, cytochrome oxidase was present in the common ancestor of Archaea and Bacteria whereas oxygenic photosynthesis appeared in Bacteria. Second, an extant cytochrome oxidase in nitrogen-fixing bacteria shows that aerobic metabolism is possible in an environment with a very low level of oxygen, such as the root nodules of leguminous plants. Therefore, we propose that aerobic metabolism in organisms with cytochrome oxidase has a monophyletic and ancient origin, prior to the appearance of eubacterial oxygenic photosynthetic organisms. Previous ArticleNext Article Volume 13Issue 111 June 1994In this issue RelatedDetailsLoading ...
DOI: 10.1093/bioinformatics/14.5.407
1998
Cited 220 times
COFFEE: an objective function for multiple sequence alignments.
In order to increase the accuracy of multiple sequence alignments, we designed a new strategy for optimizing multiple sequence alignments by genetic algorithm. We named it COFFEE (Consistency based Objective Function For alignmEnt Evaluation). The COFFEE score reflects the level of consistency between a multiple sequence alignment and a library containing pairwise alignments of the same sequences.We show that multiple sequence alignments can be optimized for their COFFEE score with the genetic algorithm package SAGA. The COFFEE function is tested on 11 test cases made of structural alignments extracted from 3D_ali. These alignments are compared to those produced using five alternative methods. Results indicate that COFFEE outperforms the other methods when the level of identity between the sequences is low. Accuracy is evaluated by comparison with the structural alignments used as references. We also show that the COFFEE score can be used as a reliability index on multiple sequence alignments. Finally, we show that given a library of structure-based pairwise sequence alignments extracted from FSSP, SAGA can produce high-quality multiple sequence alignments. The main advantage of COFFEE is its flexibility. With COFFEE, any method suitable for making pairwise alignments can be extended to making multiple alignments.The package is available along with the test cases through the WWW: http://www. ebi.ac.uk/cedriccedric.notredame@ebi.ac.uk
DOI: 10.1073/pnas.88.8.3140
1991
Cited 216 times
Plasmodium falciparum appears to have arisen as a result of lateral transfer between avian and human hosts.
It has been proposed that the acquisition of Plasmodium falciparum by man is a relatively recent event and that the sustained presence of this disease in man is unlikely to have been possible prior to the establishment of agriculture. To establish phylogenetic relationships among the Plasmodium species and to unravel the mystery of the origin of P. falciparum, we have analyzed and compared phylogenetically the small-subunit ribosomal RNA gene sequences of the species of malaria that infect humans as well as a number of those sequences from species that infect animals. Although this comparison confirmed the three established major subgroups, broadly classed as avian, simian, and rodent, we find that the human pathogen P. falciparum is monophyletic with the avian subgroup, indicating that P. falciparum and avian parasites share a relatively recent avian progenitor. The other important human pathogen, P. vivax, is very similar to a representative of the simian group of Plasmodium. The relationship between P. falciparum and the avian parasites, and the overall phylogeny of the genus, provides evidence of an exception to Farenholz's rule, which propounds synchronous speciation between host and parasite.
DOI: 10.1093/bioinformatics/18.12.1600
2002
Cited 214 times
Between-group analysis of microarray data
Abstract Motivation: Most supervised classification methods are limited by the requirement for more cases than variables. In microarray data the number of variables (genes) far exceeds the number of cases (arrays), and thus filtering and pre-selection of genes is required. We describe the application of Between Group Analysis (BGA) to the analysis of microarray data. A feature of BGA is that it can be used when the number of variables (genes) exceeds the number of cases (arrays). BGA is based on carrying out an ordination of groups of samples, using a standard method such as Correspondence Analysis (COA), rather than an ordination of the individual microarray samples. As such, it can be viewed as a method of carrying out COA with grouped data. Results: We illustrate the power of the method using two cancer data sets. In both cases, we can quickly and accurately classify test samples from any number of specified a priori groups and identify the genes which characterize these groups. We obtained very high rates of correct classification, as determined by jack-knife or validation experiments with training and test sets. The results are comparable to those from other methods in terms of accuracy but the power and flexibility of BGA make it an especially attractive method for the analysis of microarray cancer data. Availability: The methods described are implemented in ADE-4 which runs under MacOS and Windows, and is freely available at http://pbil.univ-lyon1.fr/ADE-4/. All scripts are available on request. Contact: A.Culhane@ucc.ie Supplementary information: Supplementary figures and tables are available at http://bioinfo.ucc.ie/BGA/. * To whom correspondence should be addressed.
DOI: 10.1093/nar/gkm333
2007
Cited 208 times
The M-Coffee web server: a meta-method for computing multiple sequence alignments by combining alternative alignment methods
The M-Coffee server is a web server that makes it possible to compute multiple sequence alignments (MSAs) by running several MSA methods and combining their output into one single model. This allows the user to simultaneously run all his methods of choice without having to arbitrarily choose one of them. The MSA is delivered along with a local estimation of its consistency with the individual MSAs it was derived from. The computation of the consensus multiple alignment is carried out using a special mode of the T-Coffee package [Notredame, Higgins and Heringa (T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000; 302: 205–217); Wallace, O'Sullivan, Higgins and Notredame (M-Coffee: combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 2006; 34: 1692–1699)] Given a set of sequences (DNA or proteins) in FASTA format, M-Coffee delivers a multiple alignment in the most common formats. M-Coffee is a freeware open source package distributed under a GPL license and it is available either as a standalone package or as a web service from www.tcoffee.org.
DOI: 10.1073/pnas.0511060103
2006
Cited 207 times
Multireplicon genome architecture of <i>Lactobacillus</i> <i>salivarius</i>
Lactobacillus salivarius subsp. salivarius strain UCC118 is a bacteriocin-producing strain with probiotic characteristics. The 2.13-Mb genome was shown by sequencing to comprise a 1.83 Mb chromosome, a 242-kb megaplasmid (pMP118), and two smaller plasmids. Megaplasmids previously have not been characterized in lactic acid bacteria or intestinal lactobacilli. Annotation of the genome sequence indicated an intermediate level of auxotrophy compared with other sequenced lactobacilli. No single-copy essential genes were located on the megaplasmid. However, contingency amino acid metabolism genes and carbohydrate utilization genes, including two genes for completion of the pentose phosphate pathway, were megaplasmid encoded. The megaplasmid also harbored genes for the Abp118 bacteriocin, a bile salt hydrolase, a presumptive conjugation locus, and other genes potentially relevant for probiotic properties. Two subspecies of L. salivarius are recognized, salivarius and salicinius, and we detected megaplasmids in both subspecies by pulsed-field gel electrophoresis of sizes ranging from 100 kb to 380 kb. The discovery of megaplasmids of widely varying size in L. salivarius suggests a possible mechanism for genome expansion or contraction to adapt to different environments.
DOI: 10.1385/0-89603-276-0:307
1994
Cited 174 times
Clustal V: Multiple Alignment of DNA and Protein Sequences
CLUSTAL is a package for performing fast and reliable automatic multiple alignment of many DNA or protein sequences. It was originally written for IBM-compatible microcomputers (1,2) and was later reorganized as a single program for VAX mainframes. Recently (3), the package was completely rewritten as a new program, CLUSTAL V, which is freely available for a wide variety of computer systems and which has a number of new features. The main improvements are the calculation of phylogenetic trees from sequence data sets with a bootstrap option for calculating confidence intervals on the groupings and the ability to align alignments with each other.
DOI: 10.1371/journal.ppat.1004365
2014
Cited 116 times
Comparative Phenotypic Analysis of the Major Fungal Pathogens Candida parapsilosis and Candida albicans
Candida parapsilosis and Candida albicans are human fungal pathogens that belong to the CTG clade in the Saccharomycotina. In contrast to C. albicans, relatively little is known about the virulence properties of C. parapsilosis, a pathogen particularly associated with infections of premature neonates. We describe here the construction of C. parapsilosis strains carrying double allele deletions of 100 transcription factors, protein kinases and species-specific genes. Two independent deletions were constructed for each target gene. Growth in >40 conditions was tested, including carbon source, temperature, and the presence of antifungal drugs. The phenotypes were compared to C. albicans strains with deletions of orthologous transcription factors. We found that many phenotypes are shared between the two species, such as the role of Upc2 as a regulator of azole resistance, and of CAP1 in the oxidative stress response. Others are unique to one species. For example, Cph2 plays a role in the hypoxic response in C. parapsilosis but not in C. albicans. We found extensive divergence between the biofilm regulators of the two species. We identified seven transcription factors and one protein kinase that are required for biofilm development in C. parapsilosis. Only three (Efg1, Bcr1 and Ace2) have similar effects on C. albicans biofilms, whereas Cph2, Czf1, Gzf3 and Ume6 have major roles in C. parapsilosis only. Two transcription factors (Brg1 and Tec1) with well-characterized roles in biofilm formation in C. albicans do not have the same function in C. parapsilosis. We also compared the transcription profile of C. parapsilosis and C. albicans biofilms. Our analysis suggests the processes shared between the two species are predominantly metabolic, and that Cph2 and Bcr1 are major biofilm regulators in C. parapsilosis.
DOI: 10.1128/ec.00159-10
2010
Cited 114 times
Regulation of the Hypoxic Response in Candida albicans
ABSTRACT The regulation of the response of Candida albicans to hypoxic (low-oxygen) conditions is poorly understood. We used microarray and other transcriptional analyses to investigate the role of the Upc2 and Bcr1 transcription factors in controlling expression of genes involved in cell wall metabolism, ergosterol synthesis, and glycolysis during adaptation to hypoxia. Hypoxic induction of the ergosterol pathway is mimicked by treatment with sterol-lowering drugs (ketoconazole) and requires UPC2 . Expression of three members of the family CFEM ( c ommon in several f ungal e xtracellular m embranes) of cell wall genes ( RBT5 , PGA7 , and PGA10 ) is also induced by hypoxia and ketoconazole and requires both UPC2 and BCR1 . Expression of glycolytic genes is induced by hypoxia but not by treatment with sterol-lowering drugs, whereas expression of respiratory pathway genes is repressed. However, Upc2 does not play a major role in regulating expression of genes required for central carbon metabolism. Our results indicate that regulation of gene expression in response to hypoxia in C. albicans is complex and is signaled both via lowered sterol levels and other unstudied mechanisms. We also show that induction of filamentation under hypoxic conditions requires the Ras1- and Cdc35-dependent pathway.
DOI: 10.1111/j.1365-2958.2011.07794.x
2011
Cited 98 times
The copper regulon of the human fungal pathogen Cryptococcus neoformans H99
Cryptococcus neoformans is a human fungal pathogen that is the causative agent of cryptococcosis and fatal meningitis in immuno-compromised hosts. Recent studies suggest that copper (Cu) acquisition plays an important role in C. neoformans virulence, as mutants that lack Cuf1, which activates the Ctr4 high affinity Cu importer, are hypo-virulent in mouse models. To understand the constellation of Cu-responsive genes in C. neoformans and how their expression might contribute to virulence, we determined the transcript profile of C. neoformans in response to elevated Cu or Cu deficiency. We identified two metallothionein genes (CMT1 and CMT2), encoding cysteine-rich Cu binding and detoxifying proteins, whose expression is dramatically elevated in response to excess Cu. We identified a new C. neoformans Cu transporter, CnCtr1, that is induced by Cu deficiency and is distinct from CnCtr4 and which shows significant phylogenetic relationship to Ctr1 from other fungi. Surprisingly, in contrast to other fungi, we found that induction of both CnCTR1 and CnCTR4 expression under Cu limitation, and CMT1 and CMT2 in response to Cu excess, are dependent on the CnCuf1 Cu metalloregulatory transcription factor. These studies set the stage for the evaluation of the specific Cuf1 target genes required for virulence in C. neoformans.
DOI: 10.1016/j.meegid.2013.03.021
2013
Cited 91 times
Hepatitis B virus subgenotyping: History, effects of recombination, misclassifications, and corrections
Hepatitis B virus (HBV) has evolved into phylogenetically separable genotypes and subgenotypes. Accurately assigning the subgenotype for an HBV strain is of clinical and epidemiological significance. In this paper, we review the recommendations currently employed for HBV subgenotyping, the history of HBV subgenotyping, the effects of recombination on HBV subgenotyping, misclassifications in HBV subgenotyping, and suggestions are made to correct the misclassifications. Finally, proposals are made to guide future HBV subgenotyping.
DOI: 10.1093/nar/gkw265
2016
Cited 71 times
ProViz—a web-based visualization tool to investigate the functional and evolutionary features of protein sequences
Low-throughput experiments and high-throughput proteomic and genomic analyses have created enormous quantities of data that can be used to explore protein function and evolution. The ability to consolidate these data into an informative and intuitive format is vital to our capacity to comprehend these distinct but complementary sources of information. However, existing tools to visualize protein-related data are restricted by their presentation, sources of information, functionality or accessibility. We introduce ProViz, a powerful browser-based tool to aid biologists in building hypotheses and designing experiments by simplifying the analysis of functional and evolutionary features of proteins. Feature information is retrieved in an automated manner from resources describing protein modular architecture, post-translational modification, structure, sequence variation and experimental characterization of functional regions. These features are mapped to evolutionary information from precomputed multiple sequence alignments. Data are displayed in an interactive and information-rich yet intuitive visualization, accessible through a simple protein search interface. This allows users with limited bioinformatic skills to rapidly access data pertinent to their research. Visualizations can be further customized with user-defined data either manually or using a REST API. ProViz is available at http://proviz.ucd.ie/.
DOI: 10.1371/journal.pgen.1006404
2016
Cited 71 times
Multiple Origins of the Pathogenic Yeast Candida orthopsilosis by Separate Hybridizations between Two Parental Species
Mating between different species produces hybrids that are usually asexual and stuck as diploids, but can also lead to the formation of new species. Here, we report the genome sequences of 27 isolates of the pathogenic yeast Candida orthopsilosis. We find that most isolates are diploid hybrids, products of mating between two unknown parental species (A and B) that are 5% divergent in sequence. Isolates vary greatly in the extent of homogenization between A and B, making their genomes a mosaic of highly heterozygous regions interspersed with homozygous regions. Separate phylogenetic analyses of SNPs in the A- and B-derived portions of the genome produces almost identical trees of the isolates with four major clades. However, the presence of two mutually exclusive genotype combinations at the mating type locus, and recombinant mitochondrial genomes diagnostic of inter-clade mating, shows that the species C. orthopsilosis does not have a single evolutionary origin but was created at least four times by separate interspecies hybridizations between parents A and B. Older hybrids have lost more heterozygosity. We also identify two isolates with homozygous genomes derived exclusively from parent A, which are pure non-hybrid strains. The parallel emergence of the same hybrid species from multiple independent hybridization events is common in plant evolution, but is much less documented in pathogenic fungi.
DOI: 10.1186/1471-2105-4-59
2003
Cited 136 times
Cross-platform comparison and visualisation of gene expression data using co-inertia analysis
Rapid development of DNA microarray technology has resulted in different laboratories adopting numerous different protocols and technological platforms, which has severely impacted on the comparability of array data. Current cross-platform comparison of microarray gene expression data are usually based on cross-referencing the annotation of each gene transcript represented on the arrays, extracting a list of genes common to all arrays and comparing expression data of this gene subset. Unfortunately, filtering of genes to a subset represented across all arrays often excludes many thousands of genes, because different subsets of genes from the genome are represented on different arrays. We wish to describe the application of a powerful yet simple method for cross-platform comparison of gene expression data. Co-inertia analysis (CIA) is a multivariate method that identifies trends or co-relationships in multiple datasets which contain the same samples. CIA simultaneously finds ordinations (dimension reduction diagrams) from the datasets that are most similar. It does this by finding successive axes from the two datasets with maximum covariance. CIA can be applied to datasets where the number of variables (genes) far exceeds the number of samples (arrays) such is the case with microarray analyses.We illustrate the power of CIA for cross-platform analysis of gene expression data by using it to identify the main common relationships in expression profiles on a panel of 60 tumour cell lines from the National Cancer Institute (NCI) which have been subjected to microarray studies using both Affymetrix and spotted cDNA array technology. The co-ordinates of the CIA projections of the cell lines from each dataset are graphed in a bi-plot and are connected by a line, the length of which indicates the divergence between the two datasets. Thus, CIA provides graphical representation of consensus and divergence between the gene expression profiles from different microarray platforms. Secondly, the genes that define the main trends in the analysis can be easily identified.CIA is a robust, efficient approach to coupling of gene expression datasets. CIA provides simple graphical representations of the results making it a particularly attractive method for the identification of relationships between large datasets.
DOI: 10.1371/journal.pone.0007850
2009
Cited 112 times
Widespread Dysregulation of MiRNAs by MYCN Amplification and Chromosomal Imbalances in Neuroblastoma: Association of miRNA Expression with Survival
MiRNAs regulate gene expression at a post-transcriptional level and their dysregulation can play major roles in the pathogenesis of many different forms of cancer, including neuroblastoma, an often fatal paediatric cancer originating from precursor cells of the sympathetic nervous system. We have analyzed a set of neuroblastoma (n = 145) that is broadly representative of the genetic subtypes of this disease for miRNA expression (430 loci by stem-loop RT qPCR) and for DNA copy number alterations (array CGH) to assess miRNA involvement in disease pathogenesis. The tumors were stratified and then randomly split into a training set (n = 96) and a validation set (n = 49) for data analysis. Thirty-seven miRNAs were significantly over- or under-expressed in MYCN amplified tumors relative to MYCN single copy tumors, indicating a potential role for the MYCN transcription factor in either the direct or indirect dysregulation of these loci. In addition, we also determined that there was a highly significant correlation between miRNA expression levels and DNA copy number, indicating a role for large-scale genomic imbalances in the dysregulation of miRNA expression. In order to directly assess whether miRNA expression was predictive of clinical outcome, we used the Random Forest classifier to identify miRNAs that were most significantly associated with poor overall patient survival and developed a 15 miRNA signature that was predictive of overall survival with 72.7% sensitivity and 86.5% specificity in the validation set of tumors. We conclude that there is widespread dysregulation of miRNA expression in neuroblastoma tumors caused by both over-expression of the MYCN transcription factor and by large-scale chromosomal imbalances. MiRNA expression patterns are also predicative of clinical outcome, highlighting the potential for miRNA mediated diagnostics and therapeutics.
DOI: 10.1016/j.bbadis.2007.09.005
2008
Cited 106 times
Co-regulation of Gremlin and Notch signalling in diabetic nephropathy
Diabetic nephropathy is currently the leading cause of end-stage renal disease worldwide, and occurs in approximately one third of all diabetic patients. The molecular pathogenesis of diabetic nephropathy has not been fully characterized and novel mediators and drivers of the disease are still being described. Previous data from our laboratory has identified the developmentally regulated gene Gremlin as a novel target implicated in diabetic nephropathy in vitro and in vivo. We used bioinformatic analysis to examine whether Gremlin gene sequence and structure could be used to identify other genes implicated in diabetic nephropathy. The Notch ligand Jagged1 and its downstream effector, hairy enhancer of split-1 (Hes1), were identified as genes with significant similarity to Gremlin in terms of promoter structure and predicted microRNA binding elements. This led us to discover that transforming growth factor-beta (TGFβ1), a primary driver of cellular changes in the kidney during nephropathy, increased Gremlin, Jagged1 and Hes1 expression in human kidney epithelial cells. Elevated levels of Gremlin, Jagged1 and Hes1 were also detected in extracts from renal biopsies from diabetic nephropathy patients, but not in control living donors. In situ hybridization identified specific upregulation and co-expression of Gremlin, Jagged1 and Hes1 in the same tubuli of kidneys from diabetic nephropathy patients, but not controls. Finally, Notch pathway gene clustering showed that samples from diabetic nephropathy patients grouped together, distinct from both control living donors and patients with minimal change disease. Together, these data suggest that Notch pathway gene expression is elevated in diabetic nephropathy, co-incident with Gremlin, and may contribute to the pathogenesis of this disease.
DOI: 10.1002/ijc.22413
2007
Cited 103 times
CENP‐F expression is associated with poor prognosis and chromosomal instability in patients with primary breast cancer
DNA microarrays have the potential to classify tumors according to their transcriptome. Tissue microarrays (TMAs) facilitate the validation of biomarkers by offering a high-throughput approach to sample analysis. We reanalyzed a high profile breast cancer DNA microarray dataset containing 96 tumor samples using a powerful statistical approach, between group analyses. Among the genes we identified was centromere protein-F (CENP-F), a gene associated with poor prognosis. In a published follow-up breast cancer DNA microarray study, comprising 295 tumour samples, we found that CENP-F upregulation was significantly associated with worse overall survival (p<0.001) and reduced metastasis-free survival (p<0.001). To validate and expand upon these findings, we used 2 independent breast cancer patient cohorts represented on TMAs. CENP-F protein expression was evaluated by immunohistochemistry in 91 primary breast cancer samples from cohort I and 289 samples from cohort II. CENP-F correlated with markers of aggressive tumor behavior including ER negativity and high tumor grade. In cohort I, CENP-F was significantly associated with markers of CIN including cyclin E, increased telomerase activity, c-Myc amplification and aneuploidy. In cohort II, CENP-F correlated with VEGFR2, phosphorylated Ets-2 and Ki67, and in multivariate analysis, was an independent predictor of worse breast cancer-specific survival (p=0.036) and overall survival (p=0.040). In conclusion, we identified CENP-F as a biomarker associated with poor outcome in breast cancer and showed several novel associations of biological significance.
DOI: 10.1093/nar/gkn174
2008
Cited 103 times
R-Coffee: a method for multiple alignment of non-coding RNA
R-Coffee is a multiple RNA alignment package, derived from T-Coffee, designed to align RNA sequences while exploiting secondary structure information. R-Coffee uses an alignment-scoring scheme that incorporates secondary structure information within the alignment. It works particularly well as an alignment improver and can be combined with any existing sequence alignment method. In this work, we used R-Coffee to compute multiple sequence alignments combining the pairwise output of sequence aligners and structural aligners. We show that R-Coffee can improve the accuracy of all the sequence aligners. We also show that the consistency-based component of T-Coffee can improve the accuracy of several structural aligners. R-Coffee was tested on 388 BRAliBase reference datasets and on 11 longer Cmfinder datasets. Altogether our results suggest that the best protocol for aligning short sequences (less than 200 nt) is the combination of R-Coffee with the RNA pairwise structural aligner Consan. We also show that the simultaneous combination of the four best sequence alignment programs with R-Coffee produces alignments almost as accurate as those obtained with R-Coffee/Consan. Finally, we show that R-Coffee can also be used to align longer datasets beyond the usual scope of structural aligners. R-Coffee is freely available for download, along with documentation, from the T-Coffee web site (www.tcoffee.org).
DOI: 10.1186/1748-7188-5-21
2010
Cited 101 times
Sequence embedding for fast construction of guide trees for multiple sequence alignment
The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from http://www.clustal.org/mbed.tgz.
DOI: 10.1016/j.sbi.2005.04.002
2005
Cited 100 times
Multiple sequence alignments
Multiple sequence alignments are very widely used in all areas of DNA and protein sequence analysis. The main methods that are still in use are based on 'progressive alignment' and date from the mid to late 1980s. Recently, some dramatic improvements have been made to the methodology with respect either to speed and capacity to deal with large numbers of sequences or to accuracy. There have also been some practical advances concerning how to combine three-dimensional structural information with primary sequences to give more accurate alignments, when structures are available.
DOI: 10.1074/mcp.m700245-mcp200
2008
Cited 95 times
Comparative Proteomics Profiling of a Phospholamban Mutant Mouse Model of Dilated Cardiomyopathy Reveals Progressive Intracellular Stress Responses
Defective mobilization of Ca<sup>2+</sup> by cardiomyocytes can lead to cardiac insufficiency, but the causative mechanisms leading to congestive heart failure (HF) remain unclear. In the present study we performed exhaustive global proteomics surveys of cardiac ventricle isolated from a mouse model of cardiomyopathy overexpressing a phospholamban mutant, R9C (PLN-R9C), and exhibiting impaired Ca<sup>2+</sup> handling and death at 24 weeks and compared them with normal control littermates. The relative expression patterns of 6190 high confidence proteins were monitored by shotgun tandem mass spectrometry at 8, 16, and 24 weeks of disease progression. Significant differential abundance of 593 proteins was detected. These proteins mapped to select biological pathways such as endoplasmic reticulum stress response, cytoskeletal remodeling, and apoptosis and included known biomarkers of HF (<i>e.g.</i> brain natriuretic peptide/atrial natriuretic factor and angiotensin-converting enzyme) and other indicators of presymptomatic functional impairment. These altered proteomic profiles were concordant with cognate mRNA patterns recorded in parallel using high density mRNA microarrays, and top candidates were validated by RT-PCR and Western blotting. Mapping of our highest ranked proteins against a human diseased explant and to available data sets indicated that many of these proteins could serve as markers of disease. Indeed we showed that several of these proteins are detectable in mouse and human plasma and display differential abundance in the plasma of diseased mice and affected patients. These results offer a systems-wide perspective of the dynamic maladaptions associated with impaired Ca<sup>2+</sup> homeostasis that perturb myocyte function and ultimately converge to cause HF.
DOI: 10.1128/ec.00350-08
2009
Cited 84 times
Correlation between Biofilm Formation and the Hypoxic Response in <i>Candida parapsilosis</i>
The ability of Candida parapsilosis to form biofilms on indwelling medical devices is correlated with virulence. To identify genes that are important for biofilm formation, we used arrays representing approximately 4,000 open reading frames (ORFs) to compare the transcriptional profile of biofilm cells growing in a microfermentor under continuous flow conditions with that of cells in planktonic culture. The expression of genes involved in fatty acid and ergosterol metabolism and in glycolysis, is upregulated in biofilms. The transcriptional profile of C. parapsilosis biofilm cells resembles that of Candida albicans cells grown under hypoxic conditions. We therefore subsequently used whole-genome arrays (representing 5,900 ORFs) to determine the hypoxic response of C. parapsilosis and showed that the levels of expression of genes involved in the ergosterol and glycolytic pathways, together with several cell wall genes, are increased. Our results indicate that there is substantial overlap between the hypoxic responses of C. parapsilosis and C. albicans and that this may be important for biofilm development. Knocking out an ortholog of the cell wall gene RBT1, whose expression is induced both in biofilms and under conditions of hypoxia in C. parapsilosis, reduces biofilm development.
DOI: 10.1158/1535-7163.mct-13-0560-t
2014
Cited 76 times
GSK3 Inhibitors Regulate <i>MYCN</i> mRNA Levels and Reduce Neuroblastoma Cell Viability through Multiple Mechanisms, Including p53 and Wnt Signaling
Neuroblastoma is an embryonal tumor accounting for approximately 15% of childhood cancer deaths. There exists a clinical need to identify novel therapeutic targets, particularly for treatment-resistant forms of neuroblastoma. Therefore, we investigated the role of the neuronal master regulator GSK3 in controlling neuroblastoma cell fate. We identified novel GSK3-mediated regulation of MYC (c-MYC and MYCN) mRNA levels, which may have implications for numerous MYC-driven cancers. In addition, we showed that certain GSK3 inhibitors induced large-scale cell death in neuroblastoma cells, primarily through activating apoptosis. mRNA-seq of GSK3 inhibitor-treated cells was performed and subsequent pathway analysis revealed that multiple signaling pathways contributed to the loss of neuroblastoma cell viability. The contribution of two of the signaling pathways highlighted by the mRNA-seq analysis was functionally validated. Inhibition of the p53 tumor suppressor partly rescued the cell death phenotype, whereas activation of canonical Wnt signaling contributed to the loss of viability, in a p53-independent manner. Two GSK3 inhibitors (BIO-acetoxime and LiCl) and one small-molecule Wnt agonist (Wnt Agonist 1) demonstrated therapeutic potential for neuroblastoma treatment. These inhibitors reduced the viability of numerous neuroblastoma cell lines, even those derived from high-risk MYCN-amplified metastatic tumors, for which effective therapeutics are currently lacking. Furthermore, although LiCl was lethal to neuroblastoma cells, it did not reduce the viability of differentiated neurons. Taken together our data suggest that these small molecules may hold potential as effective therapeutic agents for the treatment of neuroblastoma and other MYC-driven cancers.
DOI: 10.1158/1078-0432.ccr-12-1420
2012
Cited 75 times
miR-187 Is an Independent Prognostic Factor in Breast Cancer and Confers Increased Invasive Potential <i>In Vitro</i>
Abstract Purpose: Here, we describe an integrated bioinformatics, functional analysis, and translational pathology approach to identify novel miRNAs involved in breast cancer progression. Experimental Design: Coinertia analysis (CIA) was used to combine a database of predicted miRNA target sites and gene expression data. Using two independent breast cancer cohorts, CIA was combined with correspondence analysis and between group analysis to produce a ranked list of miRNAs associated with disease progression. Ectopic expression studies were carried out in MCF7 cells and miRNA expression evaluated in two additional cohorts of patients with breast cancer by in situ hybridization on tissue microarrays. Results: CIA identified miR-187 as a key miRNA associated with poor outcome in breast cancer. Ectopic expression of miR-187 in breast cancer cells resulted in a more aggressive phenotype. In a test cohort (n = 117), high expression of miR-187 was associated with a trend toward reduced breast cancer–specific survival (BCSS; P = 0.058), and a significant association with reduced BCSS in lymph node–positive patients (P = 0.036). In a validation cohort (n = 470), high miR-187 was significantly associated with reduced BCSS in the entire cohort (P = 0.021) and in lymph node–positive patients (P = 0.012). Multivariate Cox regression analysis revealed that miR-187 is an independent prognostic factor in both cohorts [cohort 1: HR, 7.37; 95% confidence interval (CI), 2.05–26.51; P = 0.002; cohort 2: HR, 2.80; 95% CI, 1.52–5.16; P = 0.001] and in lymph node–positive patients in both cohorts (cohort 1: HR, 13.74; 95% CI, 2.62–72.03; P = 0.002; cohort 2: HR, 2.77; 95% CI, 1.32–5.81; P = 0.007). Conclusions: miR-187 expression in breast cancer leads to a more aggressive, invasive phenotype and acts as an independent predictor of outcome. Clin Cancer Res; 18(24); 6702–13. ©2012 AACR.
DOI: 10.1186/1471-2164-12-628
2011
Cited 70 times
Using RNA-seq to determine the transcriptional landscape and the hypoxic response of the pathogenic yeast Candida parapsilosis
Candida parapsilosis is one of the most common causes of Candida infection worldwide. However, the genome sequence annotation was made without experimental validation and little is known about the transcriptional landscape. The transcriptional response of C. parapsilosis to hypoxic (low oxygen) conditions, such as those encountered in the host, is also relatively unexplored.We used next generation sequencing (RNA-seq) to determine the transcriptional profile of C. parapsilosis growing in several conditions including different media, temperatures and oxygen concentrations. We identified 395 novel protein-coding sequences that had not previously been annotated. We removed > 300 unsupported gene models, and corrected approximately 900. We mapped the 5' and 3' UTR for thousands of genes. We also identified 422 introns, including two introns in the 3' UTR of one gene. This is the first report of 3' UTR introns in the Saccharomycotina. Comparing the introns in coding sequences with other species shows that small numbers have been gained and lost throughout evolution. Our analysis also identified a number of novel transcriptional active regions (nTARs). We used both RNA-seq and microarray analysis to determine the transcriptional profile of cells grown in normoxic and hypoxic conditions in rich media, and we showed that there was a high correlation between the approaches. We also generated a knockout of the UPC2 transcriptional regulator, and we found that similar to C. albicans, Upc2 is required for conferring resistance to azole drugs, and for regulation of expression of the ergosterol pathway in hypoxia.We provide the first detailed annotation of the C. parapsilosis genome, based on gene predictions and transcriptional analysis. We identified a number of novel ORFs and other transcribed regions, and detected transcripts from approximately 90% of the annotated protein coding genes. We found that the transcription factor Upc2 role has a conserved role as a major regulator of the hypoxic response in C. parapsilosis and C. albicans.
DOI: 10.1093/oxfordjournals.molbev.a040118
1994
Cited 94 times
Molecular evidence for the inclusion of cetaceans within the order Artiodactyla.
The transition in the cetaceans from terrestrial life to a fully aquatic existence is one of the most enduring evolutionary mysteries. Resolving the phylogenetic relationships between Cetacea and the other orders of eutherian mammals may provide us with important clues to the origin of whales and may help us date the evolutionary transition to aquatic life. Previous paleontological and molecular evidence has indicated that cetaceans and artiodactyls constitute a natural clade within subclass Eutheria. Our present phylogenetic analyses of protein and mitochondrial DNA sequence data indicate that cetaceans are not only intimately related to the artiodactyls; they are in fact deeply nested within the artiodactyl phylogenetic tree; i.e., they are more closely related to the members of one suborder of artiodactyls, the Ruminantia, than either ruminants or cetaceans are to members of the other two artiodactyl suborders: Suiformes and Tylopoda. On the basis of the rate of evolution of mitochondrial DNA sequences and using paleontological reference dates for calibration, we estimate that the whale lineage has branched off a protoruminant lineage < 50 Mya. By implication, the cetacean transition to aquatic life is inferred to be a relatively recent evolutionary event.
DOI: 10.1186/1471-2164-7-3
2006
Cited 81 times
Structural and functional properties of genes involved in human cancer
One of the main goals of cancer genetics is to identify the causative elements at the molecular level leading to cancer.We have conducted an analysis of a set of genes known to be involved in cancer in order to unveil their unique features that can assist towards the identification of new candidate cancer genes.We have detected key patterns in this group of genes in terms of the molecular function or the biological process in which they are involved as well as sequence properties. Based on these features we have developed an accurate Bayesian classification model with which human genes have been scored for their likelihood of involvement in cancer.
DOI: 10.1158/1078-0432.ccr-07-1760
2008
Cited 79 times
Altered Cytoplasmic-to-Nuclear Ratio of Survivin Is a Prognostic Indicator in Breast Cancer
Survivin (BIRC5) is a promising tumor biomarker. Conflicting data exist on its prognostic effect in breast cancer. These data may at least be partly due to the manual interpretation of immunohistochemical staining, especially as survivin can be located in both the nucleus and cytoplasm. Quantitative determination of survivin expression using image analysis offers the opportunity to develop alternative scoring models for survivin immunohistochemistry. Here, we present such a model.A breast cancer tissue microarray containing 102 tumors was stained with an anti-survivin antibody. Whole-slide scanning was used to capture high-resolution images. These images were analyzed using automated algorithms to quantify the staining.Increased nuclear, but not cytoplasmic, survivin was associated with a reduced overall survival (OS; P = 0.038) and disease-specific survival (P = 0.0015). A high cytoplasmic-to-nuclear ratio (CNR) of survivin was associated with improved OS (P = 0.005) and disease-specific survival (P = 0.05). Multivariate analysis revealed that the survivin CNR was an independent predictor of OS (hazard ratio, 0.09; 95% confidence interval, 0.01-0.76; P = 0.027). A survivin CNR of >5 correlated positively with estrogen receptor (P = 0.019) and progesterone receptor (P = 0.033) levels, whereas it was negatively associated with Ki-67 expression (P = 0.04), p53 status (P = 0.005), and c-myc amplification (P = 0.016).Different prognostic information is supplied by nuclear and cytoplasmic survivin in breast cancer. Nuclear survivin is a poor prognostic marker in breast cancer. Moreover, CNR of survivin, as determined by image analysis, is an independent prognostic factor.
DOI: 10.1164/rccm.200712-1890oc
2008
Cited 64 times
Hypoxia Selectively Activates the CREB Family of Transcription Factors in the <i>In Vivo</i> Lung
Pulmonary hypertension is a common complication of chronic hypoxic lung diseases and is associated with increased morbidity and reduced survival. The pulmonary vascular changes in response to hypoxia, both structural and functional, are unique to this circulation.To identify transcription factor pathways uniquely activated in the lung in response to hypoxia.After exposure to environmental hypoxia (10% O(2)) for varying periods (3 h to 2 wk), lungs and systemic organs were isolated from groups of adult male mice. Bioinformatic examination of genes the expression of which changed in the hypoxic lung (assessed using microarray analysis) identified potential lung-selective transcription factors controlling these changes in gene expression. In separate further experiments, lung-selective activation of these candidate transcription factors was tested in hypoxic mice and by comparing hypoxic responses of primary human pulmonary and cardiac microvascular endothelial cells in vitro.Bioinformatic analysis identified cAMP response element binding (CREB) family members as candidate lung-selective hypoxia-responsive transcription factors. Further in vivo experiments demonstrated activation of CREB and activating transcription factor (ATF)1 and up-regulation of CREB family-responsive genes in the hypoxic lung, but not in other organs. Hypoxia-dependent CREB activation and CREB-responsive gene expression was observed in human primary lung, but not cardiac microvascular endothelial cells.These findings suggest that activation of CREB and AFT1 plays a key role in the lung-specific responses to hypoxia, and that lung microvascular endothelial cells are important, proximal effector cells in the specific responses of the pulmonary circulation to hypoxia.
DOI: 10.1093/bioinformatics/btt093
2013
Cited 52 times
Making automated multiple alignments of very large numbers of protein sequences
Recent developments in sequence alignment software have made possible multiple sequence alignments (MSAs) of >100 000 sequences in reasonable times. At present, there are no systematic analyses concerning the scalability of the alignment quality as the number of aligned sequences is increased.We benchmarked a wide range of widely used MSA packages using a selection of protein families with some known structures and found that the accuracy of such alignments decreases markedly as the number of sequences grows. This is more or less true of all packages and protein families. The phenomenon is mostly due to the accumulation of alignment errors, rather than problems in guide-tree construction. This is partly alleviated by using iterative refinement or selectively adding sequences. The average accuracy of progressive methods by comparison with structure-based benchmarks can be improved by incorporating information derived from high-quality structural alignments of sequences with solved structures. This suggests that the availability of high quality curated alignments will have to complement algorithmic and/or software developments in the long-term.Benchmark data used in this study are available at http://www.clustal.org/omega/homfam-20110613-25.tar.gz and http://www.clustal.org/omega/bali3fam-26.tar.gz.Supplementary data are available at Bioinformatics online.
DOI: 10.1016/j.virol.2012.01.030
2012
Cited 48 times
Identification of novel inter-genotypic recombinants of human hepatitis B viruses by large-scale phylogenetic analysis
Recombination plays an important role in the evolutionary history of Hepatitis B virus (HBV). We performed a phylogenetic analysis of 3403 full-length HBV genome sequences isolated from humans to define the genotype. The genome sequences were divided into 13 sub-datasets, each approximately 250 bp in length. Genotype designations obtained from the sub-datasets that differed from the genotype defined by the whole genome were assigned as putative recombinants. Our results showed that 3379 out of 3403 sequences belonged to the previously described and putative genotypes A to J respectively, with 315 sequences defined in this analysis. The remaining 24 viruses had sequence divergence of less than 8% with both genotypes B and C and were provisionally assigned genotype "BC". 1047 out of 3403 sequences were identified to be putative recombinants, of which 72 were identified to be novel recombinants. Notably, all viruses of the herein described genotype "BC" were identified to be B/C recombinants.
DOI: 10.1016/j.celrep.2019.02.038
2019
Cited 37 times
An Integrated Global Analysis of Compartmentalized HRAS Signaling
<h2>Summary</h2> Modern omics technologies allow us to obtain global information on different types of biological networks. However, integrating these different types of analyses into a coherent framework for a comprehensive biological interpretation remains challenging. Here, we present a conceptual framework that integrates protein interaction, phosphoproteomics, and transcriptomics data. Applying this method to analyze HRAS signaling from different subcellular compartments shows that spatially defined networks contribute specific functions to HRAS signaling. Changes in HRAS protein interactions at different sites lead to different kinase activation patterns that differentially regulate gene transcription. HRAS-mediated signaling is the strongest from the cell membrane, but it regulates the largest number of genes from the endoplasmic reticulum. The integrated networks provide a topologically and functionally resolved view of HRAS signaling. They reveal distinct HRAS functions including the control of cell migration from the endoplasmic reticulum and TP53-dependent cell survival when signaling from the Golgi apparatus.
DOI: 10.1093/nar/21.13.2967
1993
Cited 68 times
The EMBL data library
Journal Article The EMBL data library Get access Catherine M. Rice, Catherine M. Rice European Molecular Biology LaboratoryMeyerhofstrasse 1, W-6900 Heidelberg, Germany Postal address: Data Submissions, EMBL Data Library, Postfech 10.2209, W-6900 Heidelberg, Germany Telephone: +49-6221-387258 Telefax +49-6221-387519 or 387306 Search for other works by this author on: Oxford Academic PubMed Google Scholar Rainer Fuchs, Rainer Fuchs European Molecular Biology LaboratoryMeyerhofstrasse 1, W-6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Desmond G. Higgins, Desmond G. Higgins European Molecular Biology LaboratoryMeyerhofstrasse 1, W-6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Peter J. Stoehr, Peter J. Stoehr European Molecular Biology LaboratoryMeyerhofstrasse 1, W-6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Graham N. Cameron Graham N. Cameron European Molecular Biology LaboratoryMeyerhofstrasse 1, W-6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Nucleic Acids Research, Volume 21, Issue 13, 1 July 1993, Pages 2967–2971, https://doi.org/10.1093/nar/21.13.2967 Published: 01 July 1993
DOI: 10.1002/pmic.200600898
2007
Cited 65 times
A multivariate analysis approach to the integration of proteomic and gene expression data
In order to understand even the simplest cellular processes, we need to integrate proteomic, gene expression and other biomolecular data. To date, most computational approaches aimed at integrating proteomics and gene expression data use direct gene/protein correlation measures. However, due to post-transcriptional and translational regulations, the correspondence between the expression of a gene and its protein is complicated. We apply a multivariate statistical method, co-inertia analysis (CIA), to visualise gene and proteomic expression data stemming from the same biological samples. Principal components analysis or correspondence analysis can be used for data exploration on single datasets. CIA is then used to explore the relationships between two or more datasets. We further explore the data by projecting gene ontology (GO) information onto these plots to describe the cellular processes in action. We apply these techniques to gene expression and protein abundance data from studies of the human malarial parasite life cycle and the NCI-60 cancer cell lines. In each case, we visualise gene expression, protein abundance and GO classes in the same low dimensional projections and identify GO classes that are likely to be of biological importance.
DOI: 10.1093/bioinformatics/bti159
2004
Cited 65 times
Evaluation of iterative alignment algorithms for multiple alignment
Motivation: Iteration has been used a number of times as an optimization method to produce multiple alignments, either alone or in combination with other methods. Iteration has a great advantage in that it is often very simple both in terms of coding the algorithms and the complexity of the time and memory requirements. In this paper, we systematically test several different iteration strategies by comparing the results on sets of alignment test cases. Results: We tested three schemes where iteration is used to improve an existing alignment. This was found to be remarkably effective and could induce a significant improvement in the accuracy of alignments from most packages. For example the average accuracy of ClustalW was improved by over 6% on the hardest test cases. Iteration was found to be even more powerful when it was directly incorporated into a progressive alignment scheme. Here, iteration was used to improve subalignments at each step of progressive alignment. The beneficial effects of iteration come, in part, from the ability to get round the usual local minimum problem with progressive alignment. This ability can also be used to help reduce the complexity of T-Coffee, without losing accuracy. Alignments can be generated, using T-Coffee, to align subgroups of sequences, which can then be iteratively improved and merged. Availability: All of the scripts are freely available on the web at http://www.bioinf.ucd.ie/people/iain/iteration.html Contact: iain.wallace@ucd.ie
DOI: 10.1128/jb.171.2.1166-1172.1989
1989
Cited 59 times
Replication and segregational stability of Bacillus plasmid pBAA1
A cryptic plasmid, pBAA1, was identified in an industrial Bacillus strain. The plasmid is 6.8 kilobases in size and is present in cells at a copy number of approximately 5 per chromosome equivalent. The plasmid has been maintained under industrial fermentation conditions without apparent selective pressure and so is assumed to be partition proficient. The minimal replicon was localized to a 1.4-kilobase fragment which also contains the functions required for copy number control. The very low level of segregational instability of the minimal replicon suggests that it also contains functions involved in plasmid maintenance. Comparison with other plasmids indicates that pBAA1 belongs to the group of small gram-positive plasmids which replicate by a rolling cycle-type mechanism. A sequence was identified which is required for the efficient conversion of the single plus strand to the double-stranded form during plasmid replication. Deletion of this sequence resulted in a low level of segregational plasmid instability.
DOI: 10.1093/nar/gkn278
2008
Cited 54 times
R-Coffee: a web server for accurately aligning noncoding RNA sequences
The R-Coffee web server produces highly accurate multiple alignments of noncoding RNA (ncRNA) sequences, taking into account predicted secondary structures. R-Coffee uses a novel algorithm recently incorporated in the T-Coffee package. R-Coffee works along the same lines as T-Coffee: it uses pairwise or multiple sequence alignment (MSA) methods to compute a primary library of input alignments. The program then computes an MSA highly consistent with both the alignments contained in the library and the secondary structures associated with the sequences. The secondary structures are predicted using RNAplfold. The server provides two modes. The slow/accurate mode is restricted to small datasets (less than 5 sequences less than 150 nucleotides) and combines R-Coffee with Consan, a very accurate pairwise RNA alignment method. For larger datasets a fast method can be used (RM-Coffee mode), that uses R-Coffee to combine the output of the three packages which combines the outputs from programs found to perform best on RNA (MUSCLE, MAFFT and ProbConsRNA). Our BRAliBase benchmarks indicate that the R-Coffee/Consan combination is one of the best ncRNA alignment methods for short sequences, while the RM-Coffee gives comparable results on longer sequences. The R-Coffee web server is available at http://www.tcoffee.org.
DOI: 10.1186/s12859-015-0702-1
2015
Cited 39 times
OD-seq: outlier detection in multiple sequence alignments
Multiple sequence alignments (MSA) are widely used in sequence analysis for a variety of tasks. Outlier sequences can make downstream analyses unreliable or make the alignments less accurate while they are being constructed. This paper describes a simple method for automatically detecting outliers and accompanying software called OD-seq. It is based on finding sequences whose average distance to the rest of the sequences in a dataset, is anomalous. The software can take a MSA, distance matrix or set of unaligned sequences as input. Outlier sequences are found by examining the average distance of each sequence to the rest. Anomalous average distances are then found using the interquartile range of the distribution of average distances or by bootstrapping them. The complexity of any analysis of a distance matrix is normally at least O(N 2) for N sequences. This is prohibitive for large N but is reduced here by using the mBed algorithm from Clustal Omega. This reduces the complexity to O(N log(N)) which makes even very large alignments easy to analyse on a single core. We tested the ability of OD-seq to detect outliers using artificial test cases of sequences from Pfam families, seeded with sequences from other Pfam families. Using a MSA as input, OD-seq is able to detect outliers with very high sensitivity and specificity. OD-seq is a practical and simple method to detect outliers in MSAs. It can also detect outliers in sets of unaligned sequences, but with reduced accuracy. For medium sized alignments, of a few thousand sequences, it can detect outliers in a few seconds. Software available as http://www.bioinf.ucd.ie/download/od-seq.tar.gz .
DOI: 10.1073/pnas.1405628111
2014
Cited 37 times
Simple chained guide trees give high-quality protein multiple sequence alignments
Guide trees are used to decide the order of sequence alignment in the progressive multiple sequence alignment heuristic. These guide trees are often the limiting factor in making large alignments, and considerable effort has been expended over the years in making these quickly or accurately. In this article we show that, at least for protein families with large numbers of sequences that can be benchmarked with known structures, simple chained guide trees give the most accurate alignments. These also happen to be the fastest and simplest guide trees to construct, computationally. Such guide trees have a striking effect on the accuracy of alignments produced by some of the most widely used alignment packages. There is a marked increase in accuracy and a marked decrease in computational time, once the number of sequences goes much above a few hundred. This is true, even if the order of sequences in the guide tree is random.
DOI: 10.18632/oncotarget.6568
2015
Cited 34 times
Integrative omics reveals MYCN as a global suppressor of cellular signalling and enables network-based therapeutic target discovery in neuroblastoma
// David J. Duffy 1,7,* , Aleksandar Krstic 1,* , Melinda Halasz 1,* , Thomas Schwarzl 1,8,* , Dirk Fey 1 , Kristiina Iljin 6 , Jai Prakash Mehta 1 , Kate Killick 1 , Jenny Whilde 1 , Benedetta Turriziani 1 , Saija Haapa-Paananen 6 , Vidal Fey 6 , Matthias Fischer 5 , Frank Westermann 4 , Kai-Oliver Henrich 4 , Steffen Bannert 4 , Desmond G. Higgins 1,2,3 and Walter Kolch 1,2,3 1 Systems Biology Ireland, University College Dublin, Belfield, Dublin, Ireland 2 Conway Institute of Biomolecular &amp; Biomedical Research, University College Dublin, Belfield, Dublin, Ireland 3 School of Medicine and Medical Science, University College Dublin, Belfield, Dublin, Ireland 4 Division of NB Genomics, German Cancer Research Center (DKFZ), Heidelberg, Germany 5 Department of Paediatric Haematology and Oncology and Center for Molecular Medicine Cologne (CMMC), University Hospital Cologne, Cologne, Germany 6 VTT Technical Research Centre of Finland, Tietotie 2, Espoo, Finland 7 The Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine, Florida, USA 8 European Molecular Biology Laboratory (EMBL), Meyerhofstra&szlig;e, Heidelberg, Germany * These authors have contributed equally to this work Correspondence to: David J. Duffy, email: // Keywords : MYC (c-MYC), neuroblastoma, transcriptional regulation, mRNA sequencing (mRNA-seq), 4sU-seq Received : November 15, 2015 Accepted : November 23, 2015 Published : December 11, 2015 Abstract Despite intensive study, many mysteries remain about the MYCN oncogene&rsquo;s functions. Here we focus on MYCN&rsquo;s role in neuroblastoma, the most common extracranial childhood cancer. MYCN gene amplification occurs in 20% of cases, but other recurrent somatic mutations are rare. This scarcity of tractable targets has hampered efforts to develop new therapeutic options. We employed a multi-level omics approach to examine MYCN functioning and identify novel therapeutic targets for this largely un-druggable oncogene. We used systems medicine based computational network reconstruction and analysis to integrate a range of omic techniques: sequencing-based transcriptomics, genome-wide chromatin immunoprecipitation, siRNA screening and interaction proteomics, revealing that MYCN controls highly connected networks, with MYCN primarily supressing the activity of network components. MYCN&rsquo;s oncogenic functions are likely independent of its classical heterodimerisation partner, MAX. In particular, MYCN controls its own protein interaction network by transcriptionally regulating its binding partners. Our network-based approach identified vulnerable therapeutically targetable nodes that function as critical regulators or effectors of MYCN in neuroblastoma. These were validated by siRNA knockdown screens, functional studies and patient data. We identified &beta;-estradiol and MAPK/ERK as having functional cross-talk with MYCN and being novel targetable vulnerabilities of MYCN-amplified neuroblastoma. These results reveal surprising differences between the functioning of endogenous, overexpressed and amplified MYCN, and rationalise how different MYCN dosages can orchestrate cell fate decisions and cancerous outcomes. Importantly, this work describes a systems-level approach to systematically uncovering network based vulnerabilities and therapeutic targets for multifactorial diseases by integrating disparate omic data types.
DOI: 10.1093/bioinformatics/btg1029
2003
Cited 65 times
APDB: a novel measure for benchmarking sequence alignment methods without reference alignments
We describe APDB, a novel measure for evaluating the quality of a protein sequence alignment, given two or more PDB structures. This evaluation does not require a reference alignment or a structure superposition. APDB is designed to efficiently and objectively benchmark multiple sequence alignment methods.Using existing collections of reference multiple sequence alignments and existing alignment methods, we show that APDB gives results that are consistent with those obtained using conventional evaluations. We also show that APDB is suitable for evaluating sequence alignments that are structurally equivalent. We conclude that APDB provides an alternative to more conventional methods used for benchmarking sequence alignment packages.
DOI: 10.1128/aem.71.12.8692-8705.2005
2005
Cited 62 times
Prophage-Like Elements in Bifidobacteria: Insights from Genomics, Transcription, Integration, Distribution, and Phylogenetic Analysis
ABSTRACT So far, there is only fragmentary and unconfirmed information on bacteriophages infecting the genus Bifidobacterium . In this report we analyzed three prophage-like elements that are present in the genomes of Bifidobacterium breve UCC 2003, Bifidobacterium longum NCC 2705, and Bifidobacterium longum DJO10A, designated Bbr-1, Bl-1, and Blj-1, respectively. These prophagelike elements exhibit homology with genes of double-stranded DNA bacteriophages spanning a broad phylogenetic range of host bacteria and are surprisingly closely related to bacteriophages infecting low-G+C bacteria. All three prophage-like elements are integrated in a tRNA Met gene, which appears to be reconstructed following phage integration. Analysis of the distribution of this integration site in many bifidobacterial species revealed that the attB sites are well conserved. The Blj-1 prophage is 36.9 kb long and was induced when a B. longum DJO10A culture was exposed to mitomycin C or hydrogen peroxide. The Bbr-1 prophage-like element appears to consist of a noninducible 28.5-kb chimeric DNA fragment composed of a composite mobile element inserted into prophage-like sequences, which do not appear to be widely distributed among B. breve strains. Northern blot analysis of the Bbr-1 prophage-like element showed that large parts of its genome are transcriptionally silent. Interestingly, a gene predicted to encode an extracellular beta-glucosidase carried within the Bbr-1 prophage-like element was shown to be transcribed.
DOI: 10.1093/oxfordjournals.molbev.a040038
1993
Cited 60 times
Evolutionary relatedness of some primate models of Plasmodium.
Primate--and, specifically, monkey--malaria infections are commonly used for understanding the pathology of and immune response to the human disease because they are thought to resemble most closely the host-parasite relationship found in humans. Plasmodium cynomolgi is used extensively as a model for the human parasite, P. vivax, and P. knowlesi is used primarily as a model for the development of erythrocytic-stage vaccines. Both of these simian parasites can naturally infect man, resulting in mildly symptomatic episodes of the disease. The phylogenetic relationship between these two simian parasites and previously characterized Plasmodium species, including P. vivax, was examined by comparison of the asexually expressed small-subunit ribosomal RNA genes. Our analysis confirmed that P. vivax is most closely related to P. cynomolgi and that it remains an appropriate model of the human pathogen. Furthermore, with P. knowlesi and P. fragile, these two species form a group of closely related species, distant from other Plasmodium species. What is considered to be the most ancient of the human malaria pathogens, P. malariae, was also included in the analysis and does not group at all with other simian or human parasites.
DOI: 10.1016/s0378-1119(99)00122-5
1999
Cited 59 times
Molecular evolution of immunoglobulin and fibronectin domains in titin and related muscle proteins
The family of regulatory and structural muscle proteins, which includes the giant kinases titin, twitchin and projectin, has sequences composed predominantly of serially linked immunoglobulin I set (Ig) and fibronectin type III (FN3) domains. This paper explores the evolutionary relationships between 16 members of this family. In titin, groups of Ig and FN3 domains are arranged in a regularly repeating pattern of seven and 11 domains. The 11-domain super-repeat has its origins in the seven-domain super-repeat and a model for the duplications which gave rise to this super-repeat is proposed. A super-repeat composed solely of immunoglobulin domains is found in the skeletal muscle isoform of titin. Twitchin and projectin, which are presumed to be orthologs, have undergone significant insertion/deletion of domains since their divergence. The common ancestry of myomesin, skelemin and M-protein is shown. The relationship between myosin binding proteins (MyBPs) C and H is confirmed, and MyBP–H is proposed to have given rise to MyBP–C by the acquisition of some titin domains.
DOI: 10.1517/14712598.5.8.1069
2005
Cited 57 times
Application of DNA microarray technology in determining breast cancer prognosis and therapeutic response
There are > 1.15 million cases of breast cancer diagnosed worldwide annually, and it is the second leading cause of cancer death in the European Union. The optimum management of patients with breast cancer requires accurate prognostic and predictive factors. At present, only a small number of such factors are used clinically. DNA microarrays have the potential to measure the expression of tens of thousands of genes simultaneously. Recent preliminary findings suggest that DNA microarray-based gene expression profiling can provide powerful and independent prognostic information in patients with newly diagnosed breast cancer. As well as providing prognostic information, emerging results suggest that DNA microarrays can also be used for predicting response or resistance to treatment, especially to neoadjuvant chemotherapy. Prior to clinical application, these preliminary findings must be validated using large-scale prospective studies. This article reviews these advances and also examines the role of DNA microarrays in reducing the number of patients who receive inappropriate chemotherapy. The most recent data supporting the integration of various publicly available data sets is also reviewed in detail.
DOI: 10.1007/bf00116547
1991
Cited 56 times
Molecular phylogeny of the subgenus Sophophora of Drosophila derived from large subunit of ribosomal RNA sequences
DOI: 10.1038/340604a0
1989
Cited 50 times
Malarial proteinase?
DOI: 10.1371/journal.pone.0014454
2010
Cited 37 times
A Complete Analysis of HA and NA Genes of Influenza A Viruses
Background More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza. Methodology/Principal Findings This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods. Conclusions/Significance We analyzed 18975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.
DOI: 10.1371/journal.pone.0041997
2012
Cited 33 times
Recombination in Hepatitis C Virus: Identification of Four Novel Naturally Occurring Inter-Subtype Recombinants
Recombination in Hepatitis C virus (HCV) is considered to be rare. In this study, we performed a phylogenetic analysis of 1278 full-length HCV genome sequences to identify potential recombination events. Nine inter-genotype recombinants were identified, all of which have been previously reported. This confirms the rarity of inter-genotype HCV recombinants. The analysis also identified five inter-subtype recombinants, four of which are documented for the first time (EU246930, EU246931, EU246932, and EU246937). Specifically, the latter represent four different novel recombination types (6a/6o, 6e/6o, 6e/6h, and 6n/6o), and this was well supported by seven independent methods embedded in RDP. The breakpoints of the four novel HCV recombinants are located within the NS5B coding region and were different from all previously reported breakpoints. While the locations of the breakpoints identified by RDP were not identical, they are very close. Our study suggests that while recombination in HCV is rare, this warrants further investigation.
DOI: 10.18632/oncotarget.11203
2016
Cited 29 times
Wnt signalling is a bi-directional vulnerability of cancer cells
Wnt signalling is involved in the formation, metastasis and relapse of a wide array of cancers.However, there is ongoing debate as to whether activation or inhibition of the pathway holds the most promise as a therapeutic treatment for cancer, with conflicting evidence from a variety of tumour types.We show that Wnt/β-catenin signalling is a bi-directional vulnerability of neuroblastoma, malignant melanoma and colorectal cancer, with hyper-activation or repression of the pathway both representing a promising therapeutic strategy, even within the same cancer type. Hyper-activation directs cancer cells to undergo apoptosis, even in cellsoncogenically driven by β-catenin.Wnt inhibition blocks proliferation of cancer cells and promotes neuroblastoma differentiation.Wnt and retinoic acid co-treatments synergise, representing a promising combination treatment for MYCN-amplified neuroblastoma.Additionally, we report novel cross-talks between MYCN and β-catenin signalling, which repress normal β-catenin mediated transcriptional regulation.A β-catenin target gene signature could predict patient outcome, as could the expression level of its DNA binding partners, the TCF/LEFs.This β-catenin signature provides a tool to identify neuroblastoma patients likely to benefit from Wnt-directed therapy.Taken together, we show that Wnt/β-catenin signalling is a bi-directional vulnerability of a number of cancer entities, and potentially a more broadly conserved feature of malignant cells.
DOI: 10.1016/j.bbrc.2016.04.085
2016
Cited 26 times
Prolyl hydroxylase-1 regulates hepatocyte apoptosis in an NF-κB-dependent manner
Hepatocyte death is an important contributing factor in a number of diseases of the liver. PHD1 confers hypoxic sensitivity upon transcription factors including the hypoxia inducible factor (HIF) and nuclear factor-kappaB (NF-κB). Reduced PHD1 activity is linked to decreased apoptosis. Here, we investigated the underlying mechanism(s) in hepatocytes. Basal NF-κB activity was elevated in PHD1−/− hepatocytes compared to wild type controls. ChIP-seq analysis confirmed enhanced binding of NF-κB to chromatin in regions proximal to the promoters of genes involved in the regulation of apoptosis. Inhibition of NF-κB (but not knock-out of HIF-1 or HIF-2) reversed the anti-apoptotic effects of pharmacologic hydroxylase inhibition. We hypothesize that PHD1 inhibition leads to altered expression of NF-κB-dependent genes resulting in reduced apoptosis. This study provides new information relating to the possible mechanism of therapeutic action of hydroxylase inhibitors that has been reported in pre-clinical models of intestinal and hepatic disease.
DOI: 10.1111/j.1471-4159.2006.04418.x
2006
Cited 43 times
Temporal change in gene expression in the rat dentate gyrus following passive avoidance learning
A learning event initiates a cascade of altered gene expression leading to synaptic remodelling within the hippocampal dentate gyrus, a structure vital to memory formation. To illuminate this transcriptional program of synaptic plasticity we used microarrays to quantify mRNA from the rat dentate gyrus at increasing times following passive avoidance learning. Approximately, 500 known genes were transcriptionally regulated across the 24 h post-training period. The 0-2 h period saw up-regulation of genes involved in transcription while genes with a role in synaptic/cytoskeletal structure increased 0-6 h, consistent with structural rearrangements known to occur at these times. The most striking feature was the profound down-regulation, across all functional groups, 12 h post-training. Bioinformatics analysis identified the likely transcription factors controlling gene expression in each post-training period. The role of NF kappa B, implicated in the early post-training period was subsequently confirmed with activation and nuclear translocation seen in dentate granule neurons following training. mRNA changes for four genes, LRP3 (0 h), alpha actin (3 h), SNAP25 and NSF (6-12 h), were validated at message and/or protein level and shown to be learning specific. Thus, the memory-associated transcriptional cascade supports the cardinal periods of synaptic loosening, reorganisation and selection thought to underpin the process of long-term memory consolidation in the hippocampus.
DOI: 10.1038/sj.bjc.6604931
2009
Cited 36 times
Analysis of differential gene expression in colorectal cancer and stroma using fluorescence-activated cell sorting purification
Tumour stroma gene expression in biopsy specimens may obscure the expression of tumour parenchyma, hampering the predictive power of microarrays. We aimed to assess the utility of fluorescence-activated cell sorting (FACS) for generating cell populations for gene expression analysis and to compare the gene expression of FACS-purified tumour parenchyma to that of whole tumour biopsies. Single cell suspensions were generated from colorectal tumour biopsies and tumour parenchyma was separated using FACS. Fluorescence-activated cell sorting allowed reliable estimation and purification of cell populations, generating parenchymal purity above 90%. RNA from FACS-purified and corresponding whole tumour biopsies was hybridised to Affymetrix oligonucleotide microarrays. Whole tumour and parenchymal samples demonstrated differential gene expression, with 289 genes significantly overexpressed in the whole tumour, many of which were consistent with stromal gene expression (e.g., COL6A3, COL1A2, POSTN, TIMP2). Genes characteristic of colorectal carcinoma were overexpressed in the FACS-purified cells (e.g., HOX2D and RHOB). We found FACS to be a robust method for generating samples for gene expression analysis, allowing simultaneous assessment of parenchymal and stromal compartments. Gross stromal contamination may affect the interpretation of cancer gene expression microarray experiments, with implications for hypotheses generation and the stability of expression signatures used for predicting clinical outcomes.
DOI: 10.1186/1471-2105-11-257
2010
Cited 33 times
Detecting microRNA activity from gene expression data
MicroRNAs (miRNAs) are non-coding RNAs that regulate gene expression by binding to the messenger RNA (mRNA) of protein coding genes. They control gene expression by either inhibiting translation or inducing mRNA degradation. A number of computational techniques have been developed to identify the targets of miRNAs. In this study we used predicted miRNA-gene interactions to analyse mRNA gene expression microarray data to predict miRNAs associated with particular diseases or conditions.Here we combine correspondence analysis, between group analysis and co-inertia analysis (CIA) to determine which miRNAs are associated with differences in gene expression levels in microarray data sets. Using a database of miRNA target predictions from TargetScan, TargetScanS, PicTar4way PicTar5way, and miRanda and combining these data with gene expression levels from sets of microarrays, this method produces a ranked list of miRNAs associated with a specified split in samples. We applied this to three different microarray datasets, a papillary thyroid carcinoma dataset, an in-house dataset of lipopolysaccharide treated mouse macrophages, and a multi-tissue dataset. In each case we were able to identified miRNAs of biological importance.We describe a technique to integrate gene expression data and miRNA target predictions from multiple sources.
DOI: 10.1371/journal.pone.0084714
2014
Cited 27 times
Loss of Olfactory Receptor Function in Hominin Evolution
The mammalian sense of smell is governed by the largest gene family, which encodes the olfactory receptors (ORs). The gain and loss of OR genes is typically correlated with adaptations to various ecological niches. Modern humans have 853 OR genes but 55% of these have lost their function. Here we show evidence of additional OR loss of function in the Neanderthal and Denisovan hominin genomes using comparative genomic methodologies. Ten Neanderthal and 8 Denisovan ORs show evidence of loss of function that differ from the reference modern human OR genome. Some of these losses are also present in a subset of modern humans, while some are unique to each lineage. Morphological changes in the cranium of Neanderthals suggest different sensory arrangements to that of modern humans. We identify differences in functional olfactory receptor genes among modern humans, Neanderthals and Denisovans, suggesting varied loss of function across all three taxa and we highlight the utility of using genomic information to elucidate the sensory niches of extinct species.
DOI: 10.1186/1471-2164-15-825
2014
Cited 25 times
Genes and signaling networks regulated during zebrafish optic vesicle morphogenesis
The genetic cascades underpinning vertebrate early eye morphogenesis are poorly understood. One gene family essential for eye morphogenesis encodes the retinal homeobox (Rx) transcription factors. Mutations in the human retinal homeobox gene (RAX) can lead to gross morphological phenotypes ranging from microphthalmia to anophthalmia. Zebrafish rx3 null mutants produce a similar striking eyeless phenotype with an associated expanded forebrain. Thus, we used zebrafish rx3-/- mutants as a model to uncover an Rx3-regulated gene network during early eye morphogenesis. Rx3-regulated genes were identified using whole transcriptomic sequencing (RNA-seq) of rx3-/- mutants and morphologically wild-type siblings during optic vesicle morphogenesis. A gene co-expression network was then constructed for the Rx3-regulated genes, identifying gene cross-talk during early eye development. Genes highly connected in the network are hub genes, which tend to exhibit higher expression changes between rx3-/- mutants and normal phenotype siblings. Hub genes down-regulated in rx3-/- mutants encompass homeodomain transcription factors and mediators of retinoid-signaling, both associated with eye development and known human eye disorders. In contrast, genes up-regulated in rx3-/- mutants are centered on Wnt signaling pathways, associated with brain development and disorders. The temporal expression pattern of Rx3-regulated genes was further profiled during early development from maternal stage until visual function is fully mature. Rx3-regulated genes exhibited synchronized expression patterns, and a transition of gene expression during the early segmentation stage when Rx3 was highly expressed. Furthermore, most of these deregulated genes are enriched with multiple RAX-binding motif sequences on the gene promoter. Here, we assembled a comprehensive model of Rx3-regulated genes during early eye morphogenesis. Rx3 promotes optic vesicle morphogenesis and represses brain development through a highly correlated and modulated network, exhibiting repression of genes mediating Wnt signaling and concomitant enhanced expression of homeodomain transcription factors and retinoid-signaling genes.
DOI: 10.1002/cfg.173
2002
Cited 41 times
Overlapping Antisense Transcription in the Human Genome
Accumulating evidence indicates an important role for non-coding RNA molecules in eukaryotic cell regulation. A small number of coding and non-coding overlapping antisense transcripts (OATs) in eukaryotes have been reported, some of which regulate expression of the corresponding sense transcript. The prevalence of this phenomenon is unknown, but there may be an enrichment of such transcripts at imprinted gene loci. Taking a bioinformatics approach, we systematically searched a human mRNA database (RefSeq) for complementary regions that might facilitate pairing with other transcripts. We report 56 pairs of overlapping transcripts, in which each member of the pair is transcribed from the same locus. This allows us to make an estimate of 1000 for the minimum number of such transcript pairs in the entire human genome. This is a surprisingly large number of overlapping gene pairs and, clearly, some of the overlaps may not be functionally significant. Nonetheless, this may indicate an important general role for overlapping antisense control in gene regulation. EST databases were also investigated in order to address the prevalence of cases of imprinted genes with associated non-coding overlapping, antisense transcripts. However, EST databases were found to be completely inappropriate for this purpose.
DOI: 10.1093/nar/20.suppl.2071
1992
Cited 40 times
The EMBL Data Library
Journal Article The EMBL Data Library Get access Desmond G. Higgins, Desmond G. Higgins European Molecular Biology LaboratoryMeyerhofstrasse 1, 6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Rainer Fuchs, Rainer Fuchs European Molecular Biology LaboratoryMeyerhofstrasse 1, 6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Peter J. Stoehr, Peter J. Stoehr European Molecular Biology LaboratoryMeyerhofstrasse 1, 6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Graham N. Cameron Graham N. Cameron European Molecular Biology LaboratoryMeyerhofstrasse 1, 6900 Heidelberg, Germany Search for other works by this author on: Oxford Academic PubMed Google Scholar Nucleic Acids Research, Volume 20, Issue suppl, 11 May 1992, Pages 2071–2074, https://doi.org/10.1093/nar/20.suppl.2071 Published: 11 May 1992
DOI: 10.1016/0378-1119(91)90245-7
1991
Cited 39 times
The salmon gene encoding apolipoprotein A-I: cDNA sequence, tissue expression and evolution
A cDNA encoding an apolipoprotein (Apo) has been isolated from the Atlantic salmon (Salmo salar) and scquenced. It encodes a peptide of 258 amino acids (aa), including a signal peptide of 18 aa, with 5'- and 3'-untranslated regions of the mRNA of 12 and 329 nucleotides, respectively. The protein has structural features in common with other Apo's of human and avian origin, including conserved sequences in the signal peptide and a series of internal repeats of 22 aa. The sequence has been identified as salmon Apo A-I (sApoA-I), and has 23% aa identity with human ApoA-I. Northern-blot analysis using the sApoA-I cDNA. probe against total RNA prepared from several salmon tissues detects the expression of this gene in liver, intestine and muscle. A phylogenetic analysis reveals that the mammalian ApoA-I, ApoA-IV and Apo-E aa sequences arc more closely related to each other than any of them are to sApoA-I. This suggests that the duplication events, from which A-I, A-IV and E arose, occurred after the divergence of the tetrapod and teleost ancestors.
DOI: 10.1093/bioinformatics/btl597
2006
Cited 37 times
Integrating transcription factor binding site information with gene expression datasets
Abstract Motivation: Microarrays are widely used to measure gene expression differences between sets of biological samples. Many of these differences will be due to differences in the activities of transcription factors. In principle, these differences can be detected by associating motifs in promoters with differences in gene expression levels between the groups. In practice, this is hard to do. Results: We combine correspondence analysis, between group analysis and co-inertia analysis to determine which motifs, from a database of promoter motifs, are strongly associated with differences in gene expression levels. Given a database of motifs and gene expression levels from a set of arrays, the method produces a ranked list of motifs associated with any specified split in the arrays. We give an example using the Gene Atlas compendium of gene expression levels for human tissues where we search for motifs that are associated with expression in central nervous system (CNS) or muscle tissues. Most of the motifs that we find are known from previous work to be strongly associated with expression in CNS or muscle. We give a second example using a published prostate cancer dataset where we can simply and clearly find which transcriptional pathways are associated with differences between benign and metastatic samples. Availability: The source code is freely available upon request from the authors. Contact: Ian.Jeffery@ucd.ie
DOI: 10.1371/journal.pone.0047271
2012
Cited 26 times
Subgenotyping of Genotype C Hepatitis B Virus: Correcting Misclassifications and Identifying a Novel Subgenotype
More than ten subgenotypes of genotype C Hepatitis B virus (HBV) have been reported, including C1 to C16 and two C/D recombinant subgenotypes (CD1 and CD2), however, inconsistent designations of these subgenotypes still exist.We performed a phylogenetic analysis of all full-length genotype C HBV genome sequences to correct the misclassifications of HBV subgenotypes and to study the influence of recombination on HBV subgenotyping. Our results showed that although inclusion of the recombinant sequences changed the topology of the phylogenetic tree, it did not affect the subgenotyping of the non-recombinant sequences, except subgenotype C2. In addition, most of the subgenotypes have been properly designated. However, several misclassifications of HBV subgenotypes have been identified and corrected. For example, C11 proposed by Utsumi and colleagues in 2011 was found to be grouped with C12 proposed by Mulyanto and colleagues. Two sequences, GQ358157 and GU721029, previously designated as C6 have been re-designated as C12 and C7, respectively. Moreover, a quasi-subgenotype C2 was proposed, which included the old C2, several previously unclassified sequences and previously designated C14. In particular, we identified a novel subgenotype, tentative C14, which was well supported by phylogenetic analysis and sequence divergence of >4%.A number of misclassifications in the subgenotyping of genotype C HBV have been identified in this study. After correcting the misclassifications, we proposed a better classification for the subgenotyping of genotype C HBV, in which a novel quasi-subgenotype C2 and a novel subgenotype, tentative C14, were described. Based on this large-scale analysis, we propose that a novel subgenotype should only be reported after a complete comparison of all relevant sequences rather than a few representative sequences only.
DOI: 10.1186/s13015-015-0057-1
2015
Cited 21 times
Instability in progressive multiple sequence alignment algorithms
Progressive alignment is the standard approach used to align large numbers of sequences. As with all heuristics, this involves a tradeoff between alignment accuracy and computation time.We examine this tradeoff and find that, because of a loss of information in the early steps of the approach, the alignments generated by the most common multiple sequence alignment programs are inherently unstable, and simply reversing the order of the sequences in the input file will cause a different alignment to be generated. Although this effect is more obvious with larger numbers of sequences, it can also be seen with data sets in the order of one hundred sequences. We also outline the means to determine the number of sequences in a data set beyond which the probability of instability will become more pronounced.This has major ramifications for both the designers of large-scale multiple sequence alignment algorithms, and for the users of these alignments.
DOI: 10.1371/journal.pone.0052177
2012
Cited 20 times
Inhibition of the Pim1 Oncogene Results in Diminished Visual Function
Our objective was to profile genetic pathways whose differential expression correlates with maturation of visual function in zebrafish. Bioinformatic analysis of transcriptomic data revealed Jak-Stat signalling as the pathway most enriched in the eye, as visual function develops. Real-time PCR, western blotting, immunohistochemistry and in situ hybridization data confirm that multiple Jak-Stat pathway genes are up-regulated in the zebrafish eye between 3-5 days post-fertilisation, times associated with significant maturation of vision. One of the most up-regulated Jak-Stat genes is the proto-oncogene Pim1 kinase, previously associated with haematological malignancies and cancer. Loss of function experiments using Pim1 morpholinos or Pim1 inhibitors result in significant diminishment of visual behaviour and function. In summary, we have identified that enhanced expression of Jak-Stat pathway genes correlates with maturation of visual function and that the Pim1 oncogene is required for normal visual function.
DOI: 10.1186/1471-230x-12-116
2012
Cited 19 times
Subgenotype reclassification of genotype B hepatitis B virus
Nine subgenotypes from genotype B have been identified for hepatitis B virus (HBV). However, these subgenotypes were less conclusive as they were often designated based on a few representative strains. In addition, subgenotype B6 was designated twice for viruses of different origin. All complete genome sequences of genotype B HBV were phylogenetically analyzed. Sequence divergences between different potential subgenotypes were also assessed. Both phylogenetic and sequence divergence analyses supported the designation of subgenotypes B1, B2, B4, and B6 (from Arctic). However, sequence divergences between previously designated B3, B5, B7, B8, B9 and another B6 (from China) were mostly less than 4%. In addition, subgenotype B3 did not form a monophyly. Current evidence failed to classify original B5, B7, B8, B9, and B6 (from China) as subgenotypes. Instead, they could be considered as a quasi-subgenotype B3 of Southeast Asian and Chinese origin. In addition, previously designated B6 (from Arctic) should be renamed as B5 for continuous numbering. This novel classification is well supported by both the phylogeny and sequence divergence of > 4%.
DOI: 10.1007/bf00163156
1994
Cited 33 times
The evolution of titin and related giant muscle proteins
DOI: 10.1186/1471-2105-8-135
2007
Cited 24 times
Supervised multivariate analysis of sequence groups to identify specificity determining residues
Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments.We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids.This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
DOI: 10.1111/1755-0998.12087
2013
Cited 16 times
Using <scp>I</scp>llumina next generation sequencing technologies to sequence multigene families in <i>de novo</i> species
The advent of Next Generation Sequencing Technology (NGST) has revolutionized molecular biology research, allowing for rapid gene/genome sequencing from a multitude of diverse species. As high throughput sequencing becomes more accessible, more efficient workflows must be developed to deal with the amounts of data produced and better assemble the genomes of de novo lineages. We combine traditional laboratory methods with Illumina NGST to amplify and sequence the largest mammalian multigene family, the Olfactory Receptor gene family, for species with and without a reference genome. We develop novel assembly methods to annotate and filter these data, which can be utilized for any gene family or any species. We find no significant difference between the ratio of genes within their respective gene families of our data compared with available genomic data. Using simulated data we explore the limitations of short-read sequence data and our assembly in recovering this gene family. We highlight the benefits and shortcomings of these methods. Compared with data generated from traditional polymerase chain reaction, cloning and Sanger sequencing methodologies, sequence data generated using our pipeline increases yield and sequencing efficiency without reducing the number of unique genes amplified. A cloning step is not required, therefore shortening data generation time. The novel downstream methodologies and workflows described provide a tool to be utilized by many fields of biology, to access and analyze the vast quantities of data generated. By combining laboratory and in silico methods, we provide a means of extracting genomic information for multigene families without complete genome sequencing.
DOI: 10.1016/0169-4758(93)90066-o
1993
Cited 28 times
The Phylogeny of malaria: A useful study
Phylogenetic analyses allow an inference of the origins and evolutionary relationships between organisms. Andy Waters, Des Higgins and Tom McCutchan here discuss the benefits of an accurate understanding of these origins and relationships for malaria parasites to the modern fields of vaccine development, pathology and epidemiology.
DOI: 10.1186/1471-2164-11-50
2010
Cited 17 times
Integrating multiple genome annotation databases improves the interpretation of microarray gene expression data
The Affymetrix GeneChip is a widely used gene expression profiling platform. Since the chips were originally designed, the genome databases and gene definitions have been considerably updated. Thus, more accurate interpretation of microarray data requires parallel updating of the specificity of GeneChip probes. We propose a new probe remapping protocol, using the zebrafish GeneChips as an example, by removing nonspecific probes, and grouping the probes into transcript level probe sets using an integrated zebrafish genome annotation. This genome annotation is based on combining transcript information from multiple databases. This new remapping protocol, especially the new genome annotation, is shown here to be an important factor in improving the interpretation of gene expression microarray data.Transcript data from the RefSeq, GenBank and Ensembl databases were downloaded from the UCSC genome browser, and integrated to generate a combined zebrafish genome annotation. Affymetrix probes were filtered and remapped according to the new annotation. The influence of transcript collection and gene definition methods was tested using two microarray data sets. Compared to remapping using a single database, this new remapping protocol results in up to 20% more probes being retained in the remapping, leading to approximately 1,000 more genes being detected. The differentially expressed gene lists are consequently increased by up to 30%. We are also able to detect up to three times more alternative splicing events. A small number of the bioinformatics predictions were confirmed using real-time PCR validation.By combining gene definitions from multiple databases, it is possible to greatly increase the numbers of genes and splice variants that can be detected in microarray gene expression experiments.
DOI: 10.1093/nar/gkp821
2009
Cited 17 times
High DNA melting temperature predicts transcription start site location in human and mouse
The accurate computational prediction of transcription start sites (TSS) in vertebrate genomes is a difficult problem. The physicochemical properties of DNA can be computed in various ways and a many combinations of DNA features have been tested in the past for use as predictors of transcription. We looked in detail at melting temperature, which measures the temperature, at which two strands of DNA separate, considering the cooperative nature of this process. We find that peaks in melting temperature correspond closely to experimentally determined transcription start sites in human and mouse chromosomes. Using melting temperature alone, and with simple thresholding, we can predict TSS with accuracy that is competitive with the most accurate state-of-the-art TSS prediction methods. Accuracy is measured using both experimentally and manually determined TSS. The method works especially well with CpG island containing promoters, but also works when CpG islands are absent. This result is clear evidence of the important role of the physical properties of DNA in the process of transcription. It also points to the importance for TSS prediction methods to include melting temperature as prior information.
DOI: 10.1111/j.1471-4159.2010.06617.x
2010
Cited 15 times
Temporal dysregulation of cortical gene expression in the isolation reared Wistar rat
The critical sequence of molecular, neurotransmission and synaptic disruptions that underpin the emergence of psychiatric disorders like schizophrenia remain to be established with progress only likely using animal models that capture key features of such disorders. We have related the emergence of behavioural, neurochemical and synapse ultrastructure deficits to transcriptional dysregulation in the medial prefrontal cortex of Wistar rats reared in isolation. Isolation reared animals developed sensorimotor deficits at postnatal day 60 which persisted into adulthood. Analysis of gene expression prior to the emergence of the sensorimotor deficits revealed a significant disruption in transcriptional control, notably of immediate early and interferon-associated genes. At postnatal day 60 many gene transcripts relating particularly to GABA transmission and synapse structure, for example Gabra4, Nsf, Syn2 and Dlgh1, transiently increased expression. A subsequent decrease in genes such as Gria2 and Dlgh2 at postnatal day 80 suggested deficits in glutamatergic transmission and synapse integrity, respectively. Microdialysis studies revealed decreased extracellular glutamate suggesting a state of hypofrontality while ultrastructural analysis showed total and perforated synapse complement in layer III to be significantly reduced in the prefrontal cortex of postnatal day 80 isolated animals. These studies provide a molecular framework to understand the developmental emergence of the structural and behavioural characteristics that may in part define psychiatric illness.
DOI: 10.1016/j.jmb.2015.09.006
2015
Cited 13 times
Measuring Transcription Rate Changes via Time-Course 4-Thiouridine Pulse-Labelling Improves Transcriptional Target Identification
Identifying changes in the transcriptional regulation of target genes from high-throughput studies is important for unravelling molecular mechanisms controlled by a given perturbation. When measuring global transcript levels only, the effect of the perturbation [e.g., transcription factor (TF) overexpression or drug treatment] on its target genes is often obscured by delayed feedback and secondary effects until the changes are fully propagated. As a proof of principle, we show that selective measuring of transcripts that are only synthesised after a perturbation [4-thiouridine (4sU) sequencing (4sU-seq)] is a more sensitive method to identify targets and time-dependent transcriptional responses than global transcript profiling. By metabolically labelling RNA in a time-course setup, we could vastly increase the sensitivity of MYCN target gene detection compared to traditional RNA sequencing. The validity of targets identified by 4sU-seq was demonstrated using chromatin immunoprecipitation sequencing and neuroblastoma microarray tumour data. Here, we describe the methodology, both molecular biology and computational aspects, required to successfully apply this 4sU-seq approach.
DOI: 10.1007/s00239-007-9015-y
2007
Cited 19 times
Gene Expression, Intron Density, and Splice Site Strength in Drosophila and Caenorhabditis
DOI: 10.1073/pnas.0504801102
2005
Cited 19 times
Mind the gaps: Progress in progressive alignment
he article by Lo¨ytynoja andGoldman (1) in this issue ofPNAS describes a novel anduseful method of handling gapsin progressive multiple sequence align-ments. Gaps are the bits that get leftbehind when you try to align DNA orprotein sequences and have to use pad-ding or null characters to match homol-ogous residues. These could get placedat sites where one sequence has appar-ently lost some residues (caused by adeletion), and you simply pad out thesequence with gap characters such ashyphens or blanks to make it match upwith the sequences that have not lostanything. Similarly, if one or moresequences have some extra residues(caused by an insertion) then these willneed to be matched by gap characters inthe other sequences. It is the placementof these gaps that creates all of theproblems when you try to automaticallygenerate alignments. If insertions anddeletions never happened, then se-quences could easily be matched by slid-ing them past each other and taking thealignment that best matched the resi-dues. When gaps are needed, things getcomplicated and much of the first 20years of bioinformatics was devoted tohow these should be placed and why(e.g., refs. 2–4).When you have just two sequences,there are fast and relatively simple algo-rithms that can guarantee the best align-ment between the sequences, given ascoring function that gives a score foreach pair of aligned residues. The mostfamiliar of these is the famous dynamicprogramming algorithm, first describedfor sequence alignment by Needlemanand Wunsch (5). Gaps can be placed allover both sequences to get the bestscore so a ‘‘gap penalty’’ function isused to penalize for gaps of differentsizes. These scores are used to give abalance between gaps and matches. Inan ideal world, if you use appropriatevalues for the residue match scores suchas from a
DOI: 10.1002/dvdy.22573
2011
Cited 12 times
mab21l2 transgenics reveal novel expression patterns of mab21l1 and mab21l2, and conserved promoter regulation without sequence conservation
mab21l1 and mab21l2 paralogs have widespread and dynamic expression patterns during vertebrate development. Both genes are expressed in the developing eye, midbrain, neural tube, and branchial arches. Our goal was to identify promoter regions with activity in mab21l2 expression domains. Assays of mab21l2 promoter-EGFP constructs in zebrafish embryos confirm that constructs containing 7.2 or 4.9 kb of mab21l2 promoter region are sufficient to drive expression in known (e.g., tectum, branchial arches) and unexpected domains (e.g., lens and retinal amacrine cells). A comparative analysis identifies complementary and novel expression domains of endogenous mab21l2 (e.g., lens and ventral iridocorneal canal) and mab21l1 (e.g., retinal amacrine and ganglion cells). Interestingly, therefore, despite the absence of conserved non-coding elements, a 4.9-kb mab21l2 promoter is sufficient to recapitulate expression in tissues unique to mab21l1 or mab21l2.
DOI: 10.1186/1471-2105-15-338
2014
Cited 11 times
Systematic exploration of guide-tree topology effects for small protein alignments
Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods. We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments. Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.
DOI: 10.1073/pnas.1419351112
2015
Cited 9 times
Reply to Tan et al.: Differences between real and simulated proteins in multiple sequence alignments
Tan et al. (1) comment on our earlier paper regarding the accuracy of multiple sequence alignments (MSAs) using different guide tree topologies (2). We stress that the scope of our result was confined to alignments of very large numbers of protein sequences with known structures, where accuracy was measured against structure-based alignments. We point out that this result could not be translated to a strictly phylogenetic view. Tan et al. (1) demonstrate that, using a phylogenetic perspective, one can get the opposite result to ours. Given how they configure their test system, Tan et al.’s result is to be expected and easy to explain. If one simulates MSAs with many indels at random locations and then tests correspondence between alignments, including gaps in the test, then guide tree topology must have a huge effect. This is more or less inevitable.
DOI: 10.1371/journal.pone.0163235
2016
Cited 8 times
Identification of Non-Coding RNAs in the Candida parapsilosis Species Group
The Candida CTG clade is a monophyletic group of fungal species that translates CTG as serine, and includes the pathogens Candida albicans and Candida parapsilosis. Research has typically focused on identifying protein-coding genes in these species. Here, we use bioinformatic and experimental approaches to annotate known classes of non-coding RNAs in three CTG-clade species, Candida parapsilosis, Candida orthopsilosis and Lodderomyces elongisporus. We also update the annotation of ncRNAs in the C. albicans genome. The majority of ncRNAs identified were snoRNAs. Approximately 50% of snoRNAs (including most of the C/D box class) are encoded in introns. Most are within mono- and polycistronic transcripts with no protein coding potential. Five polycistronic clusters of snoRNAs are highly conserved in fungi. In polycistronic regions, splicing occurs via the classical pathway, as well as by nested and recursive splicing. We identified spliceosomal small nuclear RNAs, the telomerase RNA component, signal recognition particle, RNase P RNA component and the related RNase MRP RNA component in all three genomes. Stem loop IV of the U2 spliceosomal RNA and the associated binding proteins were lost from the ancestor of C. parapsilosis and C. orthopsilosis, following the divergence from L. elongisporus. The RNA component of the MRP is longer in C. parapsilosis, C. orthopsilosis and L. elongisporus than in S. cerevisiae, but is substantially shorter than in C. albicans.
DOI: 10.1007/bf02939839
1986
Cited 18 times
Factors associated with the biochemical changes in vitamin d and calcium metabolism in institutionalized patients with epilepsy