ϟ

David J. Lipman

Here are all the papers by David J. Lipman that you can download and read on OA.mg.
David J. Lipman’s last known institution is . Download David J. Lipman PDFs here.

Claim this Profile →
DOI: 10.1016/s0022-2836(05)80360-2
1990
Cited 78,589 times
Basic local alignment search tool
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
DOI: 10.1073/pnas.85.8.2444
1988
Cited 10,466 times
Improved tools for biological sequence comparison.
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
DOI: 10.1126/science.2983426
1985
Cited 3,923 times
Rapid and Sensitive Protein Similarity Searches
An algorithm was developed which facilitates the search for similarities between newly determined amino acid sequences and sequences already available in databases. Because of the algorithm's efficiency on many microcomputers, sensitive protein database searches may now become a routine procedure for molecular biologists. The method efficiently identifies regions of similar sequence and then scores the aligned identical and differing residues in those regions by means of an amino acid replacability matrix. This matrix increases sensitivity by giving high scores to those amino acid replacements which occur frequently in evolution. The algorithm has been implemented in a computer program designed to search protein databases very rapidly. For example, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes on a microcomputer (IBM PC).
DOI: 10.1126/science.278.5338.631
1997
Cited 3,323 times
A Genomic Perspective on Protein Families
In order to extract the maximum amount of information from the rapidly accumulating genome sequences, all conserved genes need to be classified according to their homologous relationships. Comparison of proteins encoded in seven complete genomes from five major phylogenetic lineages and elucidation of consistent patterns of sequence similarities allowed the delineation of 720 clusters of orthologous groups (COGs). Each COG consists of individual orthologous proteins or orthologous sets of paralogs from at least three lineages. Orthologs typically have the same function, allowing transfer of functional information from one member to an entire COG. This relation automatically yields a number of functional predictions for poorly characterized genomes. The COGs comprise a framework for functional and evolutionary genome analysis.
DOI: 10.1093/nar/gks1195
2012
Cited 2,530 times
GenBank
GenBank® (http://www.ncbi.nlm.nih.gov) is a comprehensive database that contains publicly available nucleotide sequences for almost 260 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and GenBank staff assigns accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI home page: www.ncbi.nlm.nih.gov.
DOI: 10.1073/pnas.80.3.726
1983
Cited 1,366 times
Rapid similarity searches of nucleic acid and protein data banks.
With the development of large data banks of protein and nucleic acid sequences, the need for efficient methods of searching such banks for sequences similar to a given sequence has become evident. We present an algorithm for the global comparison of sequences based on matching k-tuples of sequence elements for a fixed k. The method results in substantial reduction in the time required to search a data bank when compared with prior techniques of similarity analysis, with minimal loss in sensitivity. The algorithm has also been adapted, in a separate implementation, to produce rigorous sequence alignments. Currently, using the DEC KL-10 system, we can compare all sequences in the entire Protein Data Bank of the National Biomedical Research Foundation with a 350-residue query sequence in less than 3 min and carry out a similar analysis with a 500-base query sequence against all eukaryotic sequences in the Los Alamos Nucleic Acid Data Base in less than 2 min.
DOI: 10.1093/nar/gkv1276
2015
Cited 1,120 times
GenBank
GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for over 340 000 formally described species. Recent developments include a new starting page for submitters, a shift toward using accession.version identifiers rather than GI numbers, a wizard for submitting 16S rRNA sequences, and an Identical Protein Report to address growing issues of data redundancy. GenBank organizes the sequence data received from individual laboratories and large-scale sequencing projects into 18 divisions, and GenBank staff assign unique accession.version identifiers upon data receipt. Most submitters use the web-based BankIt or standalone Sequin programs. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the nuccore, nucest, and nucgss databases of the Entrez retrieval system, which integrates these records with a variety of other data including taxonomy nodes, genomes, protein structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP.
DOI: 10.1002/prot.340090304
1991
Cited 980 times
A workbench for multiple alignment construction and analysis
Abstract Multiple sequence alignment can be a useful technique for studying molecular evolution, as well as for analyzing relationships between structure or function and primary sequence. We have developed for this purpose an interactive program, MACAW (Multiple Alignment Construction and Analysis Workbench), that allows the user to construct multiple alignments by locating, analyzing, editing, and combining “blocks” of aligned sequence segments. MACAW incorporates several novel features. (1) Regions of local similarity are located by a new search algorithm that avoids many of the limitations of previous techniques. (2) The statistical significance of blocks of similarity is evaluted using a recently developed mathematical theory. (3) Candidate blocks may be evaluated for potential inclusion in a multiple alignment using a variety of visualization tools. (4) A user interface permits each blocks to be edited by moving its boundaries or by eliminating particular segments, and blocks may be linked to form a composite multiple alignment. No completely automatic program is likely to deal effectively with all the complexities of the multiple alignment problem; by combining a powerful similarity search algorithm with flexible editing, analysis and display tools, MACAW allows the alignment strategy to be tailored to the problem at hand.
DOI: 10.1128/jvi.02005-07
2008
Cited 942 times
The Influenza Virus Resource at the National Center for Biotechnology Information
Influenza epidemics cause morbidity and mortality worldwide (4). Each year in the United States, more than 200,000 patients are admitted to hospitals because of influenza and there are approximately 36,000 influenza-related deaths (14). In recent years, several subtypes of avian influenza viruses have jumped host species to infect humans. The H5N1 subtype, in particular, has been reported in 328 human cases and has caused 200 human deaths in 12 countries (World Health Organization, http://www.who.int/csr/disease/avian_influenza/country/cases_table_2007_09_10/en/index.html). These viruses have the potential to cause a pandemic in humans. Antiviral drugs and vaccines must be developed to minimize the damage that such a pandemic would bring. To achieve this, it is vital that researchers have free access to viral sequences in a timely fashion, and sequence analysis tools need to be readily available. Historically, the number of influenza virus sequences in public databases has been far less than those of some well-studied viruses, such as human immunodeficiency virus. The number of complete influenza virus genomes has been even smaller. In addition, many of the sequences were collected in the course of influenza surveillance programs that prioritized antigenically novel isolates. Although collecting antigenically novel isolates is appropriate for surveillance, it results in biased samples of sequenced isolates that are not representative of community cases of influenza (2, 13). Therefore, in 2004, the National Institute of Allergy and Infectious Diseases (NIAID) launched the Influenza Genome Sequencing Project (7), which aims to rapidly sequence influenza viruses from samples collected all over the world. Viral sequences were generated at the J. Craig Venter Institute, annotated at the National Center for Biotechnology Information (NCBI), and deposited in GenBank. In just over 2 years after the initiation of the project, more than 2,000 complete genomes of influenza viruses A and B had been deposited in GenBank. To help the research community to make full use of the wealth of information from such a large amount of data, which will be increasing continuously, the Influenza Virus Resource was created at NCBI in 2004.
DOI: 10.1093/nar/gkm929
2007
Cited 735 times
GenBank
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gkn741
2009
Cited 723 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the web applications is custom implementation of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gkm1000
2007
Cited 712 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data available through NCBI's web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace, Assembly, and Short Read Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Database of Genotype and Phenotype, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting the web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov .
DOI: 10.1186/1745-6150-7-12
2012
Cited 700 times
Domain enhanced lookup time accelerated BLAST
BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch. We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI’s Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST. DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the “Protein BLAST” link at http://blast.ncbi.nlm.nih.gov . This article was reviewed by Arcady Mushegian, Nick V. Grishin, and Frank Eisenhaber.
DOI: 10.1093/nar/gkl1031
2007
Cited 660 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link(BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace and Assembly Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Viral Genotyping Tools, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at Author Webpage.
DOI: 10.1101/gr.278202
2002
Cited 643 times
CDART: Protein Homology by Domain Architecture
The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the NCBI Entrez Protein Database based on domain architecture, defined as the sequential order of conserved domains in proteins. The algorithm finds protein similarities across significant evolutionary distances using sensitive protein domain profiles rather than by direct sequence similarity. Proteins similar to a query protein are grouped and scored by architecture. Relying on domain profiles allows CDART to be fast, and, because it relies on annotated functional domains, informative. Domain profiles are derived from several collections of domain definitions that include functional annotation. Searches can be further refined by taxonomy and by selecting domains of interest. CDART is available at http://www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi.
DOI: 10.1093/nar/gkn723
2009
Cited 620 times
GenBank
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 300 000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank® staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gkq1172
2010
Cited 551 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov .
DOI: 10.1093/nar/gkq1079
2010
Cited 532 times
GenBank
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gkr1202
2011
Cited 532 times
GenBank
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 250 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI home page: www.ncbi.nlm.nih.gov.
DOI: 10.1101/gr.080531.108
2009
Cited 529 times
The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes
Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.
DOI: 10.1101/gr.2596504
2004
Cited 503 times
The Status, Quality, and Expansion of the NIH Full-Length cDNA Project: The Mammalian Gene Collection (MGC)
The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5′-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog ( Xenopus ) and zebrafish ( Danio ) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
DOI: 10.1137/0148063
1988
Cited 482 times
The Multiple Sequence Alignment Problem in Biology
The study and comparison of sequences of characters from a finite alphabet is relevant to various areas of science, notably molecular biology.The measurement of sequence similarity involves the consideration of the different possible sequence alignments in order to find an optimal one for which the "distance" between sequences is minimum.By associating a path in a lattice to each alignment, a geometric insight can be brought into the problem of finding an optimal alignment.This problem can then be solved by applying a dynamic programming algorithm.However, the computational effort grows rapidly with the number N of sequences to be compared (O(I N ), where is the mean length of the sequences to be compared).It is proved here that knowledge of the measure of an arbitrarily chosen alignment can be used in combination with information from the pairwise alignments to considerably restrict the size of the region of the lattice in consideration.This reduction implies fewer computations and less memory space needed to carry out the dynamic programming optimization process.The observations also suggest new variants of the multiple alignment problem.
DOI: 10.1093/nar/gkr1184
2011
Cited 475 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gkw1070
2016
Cited 475 times
GenBank
GenBank® (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for 370 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or the NCBI Submission Portal. GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Nucleotide database, which links to related information such as taxonomy, genomes, protein sequences and structures, and biomedical journal literature in PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. Recent updates include changes to policies regarding sequence identifiers, an improved 16S submission wizard, targeted loci studies, the ability to submit methylation and BioNano mapping files, and a database of anti-microbial resistance genes.
DOI: 10.1093/nar/27.1.12
1999
Cited 470 times
GenBank
The GenBank (Registered Trademark symbol) sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from individual laboratories and from large-scale sequencing projects. Most submitters use the BankIt (Web) or Sequin programs to format and send sequence data. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive worldwide coverage. GenBank data is accessible through NCBI's integrated retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome and protein structure information. MEDLINE (Registered Trademark symbol) s from published articles describing the sequences are included as an additional source of biological annotation through the PubMed search system. Sequence similarity searching is offered through the BLAST series of database search programs. In addition to FTP, Email, and server/client versions of Entrez and BLAST, NCBI offers a wide range of World Wide Web retrieval and analysis services based on GenBank data. The GenBank database and related resources are freely accessible via the URL: http://www.ncbi.nlm.nih.gov
DOI: 10.1073/pnas.86.12.4412
1989
Cited 452 times
A tool for multiple sequence alignment.
Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to apply dynamic programming, the most widely accepted method for producing pairwise alignments, to comparisons of more than three sequences. We describe the design and application of a tool for multiple alignment of amino acid sequences that implements a new algorithm that greatly reduces the computational demands of dynamic programming. This tool is able to align in reasonable time as many as eight sequences the length of an average protein.
DOI: 10.1038/nature04239
2005
Cited 447 times
Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution
Influenza viruses are remarkably adept at surviving in the human population over a long timescale. The human influenza A virus continues to thrive even among populations with widespread access to vaccines, and continues to be a major cause of morbidity and mortality. The virus mutates from year to year, making the existing vaccines ineffective on a regular basis, and requiring that new strains be chosen for a new vaccine. Less-frequent major changes, known as antigenic shift, create new strains against which the human population has little protective immunity, thereby causing worldwide pandemics. The most recent pandemics include the 1918 'Spanish' flu, one of the most deadly outbreaks in recorded history, which killed 30-50 million people worldwide, the 1957 'Asian' flu, and the 1968 'Hong Kong' flu. Motivated by the need for a better understanding of influenza evolution, we have developed flexible protocols that make it possible to apply large-scale sequencing techniques to the highly variable influenza genome. Here we report the results of sequencing 209 complete genomes of the human influenza A virus, encompassing a total of 2,821,103 nucleotides. In addition to increasing markedly the number of publicly available, complete influenza virus genomes, we have discovered several anomalies in these first 209 genomes that demonstrate the dynamic nature of influenza transmission and evolution. This new, large-scale sequencing effort promises to provide a more comprehensive picture of the evolution of influenza viruses and of their pattern of transmission through human and animal populations. All data from this project are being deposited, without delay, in public archives.
DOI: 10.1073/pnas.87.14.5509
1990
Cited 437 times
Protein database searches for multiple alignments.
Protein database searches frequently can reveal biologically significant sequence relationships useful in understanding structure and function. Weak but meaningful sequence patterns can be obscured, however, by other similarities due only to chance. By searching a database for multiple as opposed to pairwise alignments, distant relationships are much more easily distinguished from background noise. Recent statistical results permit the power of this approach to be analyzed. Given a typical query sequence, an algorithm described here permits the current protein database to be searched for three-sequence alignments in less than 4 min. Such searches have revealed a variety of subtle relationships that pairwise search methods would be unable to detect.
DOI: 10.1186/s13059-018-1540-z
2018
Cited 379 times
SKESA: strategic k-mer extension for scrupulous assemblies
SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources. SKESA has been used for assembling over 272,000 read sets in the Sequence Read Archive at NCBI and for real-time pathogen detection. Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases .
DOI: 10.1093/nar/gkp967
2009
Cited 372 times
Database resources of the National Center for Biotechnology Information
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, Reference Sequence, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Peptidome, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
DOI: 10.1093/nar/gku1216
2014
Cited 365 times
GenBank
GenBank® (http://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for over 300 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP.
DOI: 10.1371/journal.pbio.0030300
2005
Cited 355 times
Whole-Genome Analysis of Human Influenza A Virus Reveals Multiple Persistent Lineages and Reassortment among Recent H3N2 Viruses
Understanding the evolution of influenza A viruses in humans is important for surveillance and vaccine strain selection. We performed a phylogenetic analysis of 156 complete genomes of human H3N2 influenza A viruses collected between 1999 and 2004 from New York State, United States, and observed multiple co-circulating clades with different population frequencies. Strikingly, phylogenies inferred for individual gene segments revealed that multiple reassortment events had occurred among these clades, such that one clade of H3N2 viruses present at least since 2000 had provided the hemagglutinin gene for all those H3N2 viruses sampled after the 2002-2003 influenza season. This reassortment event was the likely progenitor of the antigenically variant influenza strains that caused the A/Fujian/411/2002-like epidemic of the 2003-2004 influenza season. However, despite sharing the same hemagglutinin, these phylogenetically distinct lineages of viruses continue to co-circulate in the same population. These data, derived from the first large-scale analysis of H3N2 viruses, convincingly demonstrate that multiple lineages can co-circulate, persist, and reassort in epidemiologically significant ways, and underscore the importance of genomic analyses for future influenza surveillance.
DOI: 10.1093/nar/gkl986
2007
Cited 341 times
GenBank
GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (Author Webpage).
DOI: 10.1093/nar/gkt1030
2013
Cited 325 times
GenBank
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for over 280 000 formally described species. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole-genome shotgun and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and GenBank staff assign accession numbers upon data receipt. Daily data exchange with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI home page: www.ncbi.nlm.nih.gov.
DOI: 10.1186/1745-6150-3-20
2008
Cited 313 times
Splign: algorithms for computing spliced alignments with identification of paralogs
The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms.We describe a set of algorithms behind a tool called Splign for computing cDNA-to-Genome alignments. The algorithms include a high-performance preliminary alignment, a compartment identification based on a formally defined model of adjacent duplicated regions, and a refined sequence alignment. In a series of tests, Splign has produced more accurate results than other tools commonly used to compute spliced alignments, in a reasonable amount of time.Splign's ability to deal with various issues complicating the spliced alignment problem makes it a helpful tool in eukaryotic genome annotation processes and alternative splicing studies. Its performance is enough to align the largest currently available pools of cDNA data such as the human EST set on a moderate-sized computing cluster in a matter of hours. The duplications identification (compartmentization) algorithm can be used independently in other areas such as the study of pseudogenes.
DOI: 10.1093/nar/gkp1024
2009
Cited 311 times
GenBank
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
DOI: 10.1073/pnas.0901808106
2009
Cited 228 times
The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages
The evolutionary rates of protein-coding genes in an organism span, approximately, 3 orders of magnitude and show a universal, approximately log-normal distribution in a broad variety of species from prokaryotes to mammals. This universal distribution implies a steady-state process, with identical distributions of evolutionary rates among genes that are gained and genes that are lost. A mathematical model of such process is developed under the single assumption of the constancy of the distributions of the propensities for gene loss (PGL). This model predicts that genes of different ages, that is, genes with homologs detectable at different phylogenetic depths, substantially differ in those variables that correlate with PGL. We computationally partition protein-coding genes from humans, flies, and Aspergillus fungus into age classes, and show that genes of different ages retain the universal log-normal distribution of evolutionary rates, with a shift toward higher rates in "younger" classes but also with a substantial overlap. The only exception involves human primate-specific genes that show a heavy tail of rapidly evolving genes, probably owing to gene annotation artifacts. As predicted, the gene age classes differ in characteristics correlated with PGL. Compared with "young" genes (e.g., mammal-specific human ones), "old" genes (e.g., eukaryote-specific), on average, are longer, are expressed at a higher level, possess a higher intron density, evolve slower on the short time scale, and are subject to stronger purifying selection. Thus, genome evolution fits a simple model with approximately uniform rates of gene gain and loss, without major bursts of genomic innovation.
DOI: 10.1093/nar/26.17.3986
1998
Cited 311 times
Protein sequence similarity searches using patterns as seeds
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHIBLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.
DOI: 10.1073/pnas.200346997
2000
Cited 274 times
Lineage-specific loss and divergence of functionally linked genes in eukaryotes
By comparing 4,344 protein sequences from fission yeast Schizosaccharomyces pombe with all available eukaryotic sequences, we identified those genes that are conserved in S. pombe and nonfungal eukaryotes but are missing or highly diverged in the baker's yeast Saccharomyces cerevisiae. Since the radiation from the common ancestor with S. pombe, S. cerevisiae appears to have lost about 300 genes, and about 300 more genes have diverged by far beyond expectation. The most notable feature of the set of genes lost in S. cerevisiae is the coelimination of functionally connected groups of proteins, such as the signalosome and the spliceosome components. We predict similar coelimination of the components of the posttranscriptional gene-silencing system that includes the recently identified RNA-dependent RNA polymerase. Because one of the functions of posttranscriptional silencing appears to be "taming" of retrotransposons, the loss of this system in yeast could have triggered massive retrotransposition, resulting in elimination of introns and subsequent loss of spliceosome components that become dispensable. As the genome database grows, systematic analysis of coordinated gene loss may become a general approach for predicting new components of functional systems or even defining previously unknown functional complexes.
DOI: 10.1093/nar/21.13.2963
1993
Cited 238 times
GenBank
The GenBank sequence database has undergone an expansion in data coverage, annotation content and the development of new services for the scientific community. In addition to nucleotide sequences, data from the major protein sequence and structural databases, and from U.S. and European patents is now included in an integrated system. MEDLINE abstracts from published articles describing the sequences provide an important new source of biological annotation for sequence entries. In addition to the continued support of existing services, new CD-ROM and network-based systems have been implemented for literature retrieval and sequence similarity searching. Major releases of GenBank are now more frequent and the data are distributed in several new forms for both end users and software developers.
DOI: 10.1093/nar/25.9.1665
1997
Cited 231 times
Extracting protein alignment models from the sequence database
Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences.
DOI: 10.1186/1745-6150-1-34
2006
Cited 184 times
Long intervals of stasis punctuated by bursts of positive selection in the seasonal evolution of influenza A virus
The interpandemic evolution of the influenza A virus hemagglutinin (HA) protein is commonly considered a paragon of rapid evolutionary change under positive selection in which amino acid replacements are fixed by virtue of their effect on antigenicity, enabling the virus to evade immune surveillance.We performed phylogenetic analyses of the recently obtained large and relatively unbiased samples of the HA sequences from 1995-2005 isolates of the H3N2 and H1N1 subtypes of influenza A virus. Unexpectedly, it was found that the evolution of H3N2 HA includes long intervals of generally neutral sequence evolution without apparent substantial antigenic change ("stasis" periods) that are characterized by an excess of synonymous over nonsynonymous substitutions per site, lack of association of amino acid replacements with epitope regions, and slow extinction of coexisting virus lineages. These long periods of stasis are punctuated by shorter intervals of rapid evolution under positive selection during which new dominant lineages quickly displace previously coexisting ones. The preponderance of positive selection during intervals of rapid evolution is supported by the dramatic excess of amino acid replacements in the epitope regions of HA compared to replacements in the rest of the HA molecule. In contrast, the stasis intervals showed a much more uniform distribution of replacements over the HA molecule, with a statistically significant difference in the rate of synonymous over nonsynonymous substitution in the epitope regions between the two modes of evolution. A number of parallel amino acid replacements - the same amino acid substitution occurring independently in different lineages - were also detected in H3N2 HA. These parallel mutations were, largely, associated with periods of rapid fitness change, indicating that there are major limitations on evolutionary pathways during antigenic change. The finding that stasis is the prevailing modality of H3N2 evolution suggests that antigenic changes that lead to an increase in fitness typically result from epistatic interactions between several amino acid substitutions in the HA and, perhaps, other viral proteins. The strains that become dominant due to increased fitness emerge from low frequency strains thanks to the last amino acid replacement that completes the set of replacements required to produce a significant antigenic change; no subset of substitutions results in a biologically significant antigenic change and corresponding fitness increase. In contrast to H3N2, no clear intervals of evolution under positive selection were detected for the H1N1 HA during the same time span. Thus, the ascendancy of H1N1 in some seasons is, most likely, caused by the drop in the relative fitness of the previously prevailing H3N2 lineages as the fraction of susceptible hosts decreases during the stasis intervals.We show that the common view of the evolution of influenza virus as a rapid, positive selection-driven process is, at best, incomplete. Rather, the interpandemic evolution of influenza appears to consist of extended intervals of stasis, which are characterized by neutral sequence evolution, punctuated by shorter intervals of rapid fitness increase when evolutionary change is driven by positive selection. These observations have implications for influenza surveillance and vaccine formulation; in particular, the possibility exists that parallel amino acid replacements could serve as a predictor of new dominant strains.Ron Fouchier (nominated by Andrey Rzhetsky), David Krakauer, Christopher Lee.
DOI: 10.1371/journal.ppat.0020125
2006
Cited 182 times
Stochastic Processes Are Key Determinants of Short-Term Evolution in Influenza A Virus
Understanding the evolutionary dynamics of influenza A virus is central to its surveillance and control. While immune-driven antigenic drift is a key determinant of viral evolution across epidemic seasons, the evolutionary processes shaping influenza virus diversity within seasons are less clear. Here we show with a phylogenetic analysis of 413 complete genomes of human H3N2 influenza A viruses collected between 1997 and 2005 from New York State, United States, that genetic diversity is both abundant and largely generated through the seasonal importation of multiple divergent clades of the same subtype. These clades cocirculated within New York State, allowing frequent reassortment and generating genome-wide diversity. However, relatively low levels of positive selection and genetic diversity were observed at amino acid sites considered important in antigenic drift. These results indicate that adaptive evolution occurs only sporadically in influenza A virus; rather, the stochastic processes of viral migration and clade reassortment play a vital role in shaping short-term evolutionary dynamics. Thus, predicting future patterns of influenza virus evolution for vaccine strain selection is inherently complex and requires intensive surveillance, whole-genome sequencing, and phenotypic analysis.
DOI: 10.1126/science.8456298
1993
Cited 171 times
Ancient Conserved Regions in New Gene Sequences and the Protein Databases
Sets of new gene sequences from human, nematode, and yeast were compared with each other and with a set of Escherichia coli genes in order to detect ancient evolutionarily conserved regions (ACRs) in the encoded proteins. Nearly all of the ACRs so identified were found to be homologous to sequences in the protein databases. This suggests that currently known proteins may already include representatives of most ACRs and that new sequences not similar to any database sequence are unlikely to contain ACRs. Preliminary analyses indicate that moderately expressed genes may be more likely to contain ACRs than rarely expressed genes. It is estimated that there are fewer than 900 ACRs in all.
DOI: 10.1093/nar/22.17.3441
1994
Cited 171 times
GenBank
The GenBank sequence database continues to expand its data coverage, quality control, annotation content and retrieval services for the scientific community. Besides handling direct submissions of sequence data from authors, GenBank also incorporates DNA sequences from all available public sources; an integrated retrieval system, known as Entrez, also makes available data from the major protein sequence and structural databases, and from U.S. and European patents. MIDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation for sequence entries. GenBank supports distribution of the data via FTP, CD-ROM, and E-mail servers. Network server-client programs provide access to an integrated database for literature retrieval and sequence similarity searching.
DOI: 10.1016/0022-2836(89)90234-9
1989
Cited 170 times
Weights for data related by a tree
How can one characterize a set of data collected from different biological species, or indeed any set of data related by an evolutionary tree? The structure imposed by the tree implies that the data are not independent, and for most applications this should be taken into account. We describe strategies for weighting the data that circumvent some of the problems of dependency.
DOI: 10.1093/nar/25.1.1
1997
Cited 164 times
GenBank
The GenBank sequence database incorporates DNA sequences from all available public sources, primarily through the direct submission of sequence data from authors and from large-scale sequencing projects. Data exchange with the EMBL Data Library and the DNA Data Bank of Japan helps ensure comprehensive coverage. GenBank continues to focus on quality control and annotation while expanding data coverage and retrieval services. An integrated retrieval system, known as Entrez, incorporates data from the major DNA and protein sequence databases, along with genome maps and protein structure information. MEDLINE abstracts from published articles describing the sequences are also included as an additional source of biological annotation. Sequence similarity searching is offered through the BLAST family of programs. All of NCBI's services are offered through the World Wide Web. In addition, there are specialized server/client versions as well as FTP and e-mail server access.
DOI: 10.7554/elife.28801
2017
Cited 80 times
Towards PubMed 2.0
Staff from the National Center for Biotechnology Information in the US describe recent improvements to the PubMed search engine and outline plans for the future, including a new experimental site called PubMed Labs.
DOI: 10.1186/1471-2148-2-20
2002
Cited 148 times
The relationship of protein conservation and sequence length.
In general, the length of a protein sequence is determined by its function and the wide variance in the lengths of an organism's proteins reflects the diversity of specific functional roles for these proteins. However, additional evolutionary forces that affect the length of a protein may be revealed by studying the length distributions of proteins evolving under weaker functional constraints. We performed sequence comparisons to distinguish highly conserved and poorly conserved proteins from the bacterium Escherichia coli, the archaeon Archaeoglobus fulgidus, and the eukaryotes Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. For all organisms studied, the conserved and nonconserved proteins have strikingly different length distributions. The conserved proteins are, on average, longer than the poorly conserved ones, and the length distributions for the poorly conserved proteins have a relatively narrow peak, in contrast to the conserved proteins whose lengths spread over a wider range of values. For the two prokaryotes studied, the poorly conserved proteins approximate the minimal length distribution expected for a diverse range of structural folds. There is a relationship between protein conservation and sequence length. For all the organisms studied, there seems to be a significant evolutionary trend favoring shorter proteins in the absence of other, more specific functional constraints.
DOI: 10.1038/442981a
2006
Cited 130 times
A global initiative on sharing avian flu data
The GISAID consortium, launched this week, aims to improve the prospects of avoiding or effectively coping with a flu pandemic in the wake of the spread of H5N1 avian influenza virus. The Global Initiative on Sharing Avian Influenza Data would foster international sharing of avian influenza isolates and data, by means of an organization similar to that developed for the successful HapMap project.
DOI: 10.1137/0149012
1989
Cited 128 times
Trees, Stars, and Multiple Biological Sequence Alignment
One important problem in biological sequence comparison is how to simultaneously align several nucleic acid or protein sequences. A multiple alignment avoids possible inconsistencies among several pairwise alignments and can elucidate relationships not evident from pairwise comparisons. The basic dynamic programming algorithm for optimal multiple sequence alignment requires too much time to be practical for more than three sequences, the length of an average protein. Recently, Carrillo and Lipman (SIAMJ. Appl. Math., 48 (1988), pp. 1073–1082) have rendered feasible the optimal simultaneous alignment of as many as six sequences by showing that a consideration of minimal pairwise alignment costs can vastly decrease the number of cells a dynamic programming algorithm need consider. Their argument, however, requires the cost of a multiple alignment to be a weighted sum of the costs of its projected pairwise alignments.This paper presents an extension of Carrillo and Lipman's algorithm to the definition of multiple alignment cost as the cost of an evolutionary tree. An interesting generalization of the linear programming problem arises from the analysis.
DOI: 10.1137/0144038
1984
Cited 84 times
The Context Dependent Comparison of Biological Sequences
A general method for comparing two macromolecules is developed. The method differs from more traditional procedures in that matches are evaluated dependent on sequence context. We first define a context dependent similarity score between sequences and give a dynamic programming algorithm for its calculation. Conditions are then described which allow the conversion of the similarity score to a metric distance. The class of metrics obtained in this manner includes the Sellers metric. An advantage of the method is the ability to make very rapid comparisons and to align long sequences.
DOI: 10.1128/jvi.02257-12
2013
Cited 66 times
Sequential Seasonal H1N1 Influenza Virus Infections Protect Ferrets against Novel 2009 H1N1 Influenza Virus
Individuals <60 years of age had the lowest incidence of infection, with ~25% of these people having preexisting, cross-reactive antibodies to novel 2009 H1N1 influenza. Many people >60 years old also had preexisting antibodies to novel H1N1. These observations are puzzling because the seasonal H1N1 viruses circulating during the last 60 years were not antigenically similar to novel H1N1. We therefore hypothesized that a sequence of exposures to antigenically different seasonal H1N1 viruses can elicit an antibody response that protects against novel 2009 H1N1. Ferrets were preinfected with seasonal H1N1 viruses and assessed for cross-reactive antibodies to novel H1N1. Serum from infected ferrets was assayed for cross-reactivity to both seasonal and novel 2009 H1N1 strains. These results were compared to those of ferrets that were sequentially infected with H1N1 viruses isolated prior to 1957 or more-recently isolated viruses. Following seroconversion, ferrets were challenged with novel H1N1 influenza virus and assessed for viral titers in the nasal wash, morbidity, and mortality. There was no hemagglutination inhibition (HAI) cross-reactivity in ferrets infected with any single seasonal H1N1 influenza viruses, with limited protection to challenge. However, sequential H1N1 influenza infections reduced the incidence of disease and elicited cross-reactive antibodies to novel H1N1 isolates. The amount and duration of virus shedding and the frequency of transmission following novel H1N1 challenge were reduced. Exposure to multiple seasonal H1N1 influenza viruses, and not to any single H1N1 influenza virus, elicits a breadth of antibodies that neutralize novel H1N1 even though the host was never exposed to the novel H1N1 influenza viruses.
DOI: 10.1038/nbt.4267
2018
Cited 53 times
How user intelligence is improving PubMed
DOI: 10.1093/nar/gkh313
2004
Cited 92 times
Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals
Sequencing of multiple, nearly complete eukaryotic genomes creates opportunities for detecting previously unnoticed, subtle functional signals in non‐coding regions. A genome‐wide comparative analysis of orthologous sets of mammalian and yeast mRNAs revealed distinct patterns of evolutionary conservation at the boundaries of the untranslated regions (UTRs) and the coding region (CDS). Elevated sequence conservation was detected in ∼30 nt regions around the start codon. There seems to be a complementary relationship between sequence conservation in the ∼30 nt regions of the 5′‐UTR immediately upstream of the start codon and that in the synonymous positions of the 5′‐terminal 30 nt of the CDS: in mammalian mRNAs, the 5′‐UTR shows a greater conservation than the CDS, whereas the opposite trend holds for yeast mRNAs. Unexpectedly, a ∼30 nt region downstream of the stop codon shows a substantially lower level of sequence conservation than the downstream portions of the 3′‐UTRs. However, the sequence in this poorly conserved 30 nt portion of the 3′‐UTR is non‐random in that it has a higher GC content than the rest of the UTR. It is hypothesized that the elevated sequence conservation in the region immediately upstream of the start codon is related to the requirement for initiation factor binding during pre‐initiation ribosomal scanning. In contrast, the poorly conserved region downstream of the stop codon could be involved in the post‐ termination scanning and dissociation of the ribosomes from the mRNA, which requires only the mRNA–ribosome interaction. Additionally, it was found that the choice of the stop codon in mammals, but not in yeasts, and the context in the immediate vicinity of the stop codons in both mammals and yeasts are subject to strong selection. Thus, genome‐wide analysis of orthologous gene sets allows detection of previously unrecognized patterns of sequence conservation, which are likely to reflect hidden functional signals, such as ribosomal filters that could regulate translation by modulating the interaction between the mRNA and ribosomes.
DOI: 10.1093/nar/12.1part1.215
1984
Cited 81 times
On the statistical significance of nucleic add similarities
When evaluating sequence similarities among nucleic acids by the usual methods, statistical significance is often found when the biological significance of the similarity is dubious. We demonstrate that the known statistical properties of nucleic acid sequences strongly affect the statistical distribution of similarity values when calculated by standard procedures. We propose a series of models which account for some of these known statistical properties. The utility of the method is demonstrated in evaluating high relative similarity scores in four specific cases in which there is little biological context by which to judge the similarities. In two of the cases we identify the statistical properties which are responsible for the apparent similarity. In the other two cases the statistical significance of the similarity persists even when the known statistical properties of sequences are modelled. For one of these cases biological significance is likely while the other case remains an enigma.
DOI: 10.1016/0888-7543(90)90583-g
1990
Cited 80 times
The National Center for Biotechnology Information
The transcriptomic profile of a given organism is the set of RNA molecules expressed under certain conditions. It varies based on the biological process, stage of development, and environment. Different tools have been developed to measure gene transcription and expression in specific cellular states. Sequencing technologies vary from low to high throughput and allow whole-transcriptome analysis. RNA-seq coupled with bioinformatics approaches has been widely used in different studies to reveal the exact process of cellular transcription. RNA-seq data have also recently been used for modeling gene interactions through coexpression networks, a powerful method for making functional inferences about genes involved in a certain process. In this chapter, the evolution of RNA- seq studies, the methods used to isolate and sequence RNA, the programs necessary to assemble the data, and the essential steps for differential expression analysis are described, and the advantages and limitations of these methods are discussed. Herein, we report the applications of transcriptome studies for biotechnological purposes. For example, we describe the potential applications and methods of production of biomolecules from fungi and bacteria by genetic engineering and the development of treatments for human diseases, showing the importance of transcriptomic analyses in revealing the roles of gene expression control.
DOI: 10.1186/1745-6150-1-1
2006
Cited 78 times
A community experiment with fully open and published peer review.
We are pleased to announce a new open access journal, Biology Direct, which will be published online by BioMed Central. Biology Direct is launching with publications in the fields of Systems Biology, Computational Biology, and Evolutionary Biology, with an Immunology section to follow soon. Eventually, the journal will expand to cover other areas of biology. Launching a new research journal in biology in the year 2006 takes a lot of hubris...and/or a clearly defined goal. The crucial open access niche has been taken by the highly successful and still proliferating BMC and PLoS journals, so a new journal hardly would stand a chance and be worth the efforts of the editors and the publisher unless it defines itself in a fundamentally new way. Thus, our goals with this new journal, Biology Direct, are unapologetically ambitious: to establish a new, perhaps, better system of peer review and, in the process, bolster productive scientific debate, and provide scientists with useful guides to the literature.
DOI: 10.1016/0022-2836(83)90063-3
1983
Cited 64 times
Contextual constraints on synonymous codon choice
We have studied the statistical constraints on synonymous codon choice to evaluate various proposals regarding the origin of the bias in synonymous codon usage observed by Fiers et al. (1975), Air et al. (1976), Grantham et al. (1980) and others. We have determined the statistical dependence of the degenerate third base on either of its nearest neighbors in mitochondrial, prokaryotic, and eukaryotic coding sequences. We noted an increasing dependence of the third base on its nearest neighbors in moving from mitochrondria to prokaryotes to eukaryotes. A statistical model assuming random equiprobable selection of synonymous codons was found grossly adequate for the mitochondria, but totally indequate for prokaryotes and eukaryotes. A model assuming selection of synonymous codons reflecting a genomic strategy, i.e. the genome hypothesis of Grantham et al. (1980), gave a good approximation of the mitochondrial sequences. A statistical model which exactly maintains codon frequency, but allows the position of corresponding synonymous codons to vary was only grossly adequate for prokaryotes and totally inadequate for eukaryotes. The results of these simulations are consistent with the measures on experimental sequences and suggest that a “frequency constraint” model such as that of Grantham et al. (1980) may be an adequate explanation of the codon usage in mitochondria. However, in addition to this frequency constraint, there may be constraints on synonymous codon choice in prokaryotes due to codon context. Furthermore, any proposal to explain codon usage in eukaryotes must involve a constraint on the context of a codon in the sequence.
DOI: 10.1016/j.jfp.2022.100037
2023
Cited 6 times
Development and Single Laboratory Evaluation of a Refined and specific Real-time PCR Detection Method, Using Mitochondrial Primers (Mit1C), for the Detection of Cyclospora cayetanensis in Produce
Regulatory methods for detection of the foodborne protozoan parasite Cyclospora cayetanensis must be specific and sensitive. To that end, we designed and evaluated (in a single laboratory validation) a novel and improved primer/probe combination (Mit1C) for real-time PCR detection of C. cayetanensis in produce. The newly developed primer/probe combination targets a conserved region of the mitochondrial genome of C. cayetanensis that varies in other closely related organisms. The primer/probe combination was evaluated both in silico and using several real-time PCR kits and polymerases against an inclusivity/exclusivity panel comprised of a variety of C. cayetanensis oocysts, as well as DNA from other related Cyclospora spp. and closely related parasites. The new primer/probe combination amplified only C. cayetanensis, thus demonstrating specificity. Sensitivity was evaluated by artificially contaminating cilantro, raspberries, and romaine lettuce with variable numbers (200 and 5) of C. cayetanensis oocysts. As few as 5 oocysts were detected in 75%, 67.7%, and 50% of the spiked produce samples (cilantro, raspberries, and romaine lettuce), respectively, all uninoculated samples and no-template real-time PCR controls were negative. The improved primer/probe combination should prove an effective analytical tool for the specific detection of C. cayetanensis in produce.
DOI: 10.1093/nar/25.18.3580
1997
Cited 79 times
Making (anti)sense of non-coding sequence conservation
A substantial fraction of vertebrate mRNAs contain long conserved blocks in their untranslated regions as well as long blocks without silent changes in their protein coding regions. These conserved blocks are largely comprised of unique sequence within the genome, leaving us with an important puzzle regarding their function. A large body of experimental data shows that these regions are associated with regulation of mRNA stability. Combining this information with the rapidly accumulating data on endogenous antisense transcripts, we propose that the conserved sequences form long perfect duplexes with antisense transcripts. The formation of such duplexes may be essential for recognition by post-transcriptional regulatory systems. The conservation may then be explained by selection against the dominant negative effect of allelic divergence.
2001
Cited 67 times
PubMed: bridging the information gap.
On-line literature searches of bibliographic databases such as PubMed ([www.ncbi.nlm.nih.gov/entrez/query.fcgi][1]) are now integral to the lives of clinicians. A huge amount of knowledge can be gleaned from even a basic PubMed search, while the use of advanced functions can add speed and focus. The
DOI: 10.1371/currents.rrn1001
2009
Cited 42 times
Evolutionary Dynamics of N-Glycosylation Sites of Influenza Virus Hemagglutinin
The hemagglutinin protein of influenza virus bears several sites of N-linked asparagine glycosylation. The number and location of these sites varies with strain and substrain. The human H3 hemagglutinin has gained several glycosylation sites on the antigenically important globular head since its introduction to humans, presumably due to selection. Although there is abundant evidence that glycosylation can affect antigenic and functional properties of the protein, direct evidence for selection is lacking. We have analyzed gain and loss of glycosylation sites on the side branches of a large phylogenetic tree of H(3) HA1 sequences (branches off of the main, long-term line of descent). Side branches contrast with the main line of descent: losses of glycosylation sites are not uncommon, and they outnumber gains. Although other explanations are possible, this observation is consistent with weak selection for glycosylation sites or a more complicated pattern of selection. Furthermore, terminal and internal branches differ with respect to rates of gain and loss of glycosylation sites. This pattern would not be expected under selective neutrality, but is easily explained by weak selection or selection that changes with the immune state of the host population. Thus, it provides evidence that selection acts on the glycosylation state of hemagglutinin.
DOI: 10.3389/fmicb.2023.1212863
2023
Cited 3 times
Development of a targeted amplicon sequencing method for genotyping Cyclospora cayetanensis from fresh produce and clinical samples with enhanced genomic resolution and sensitivity
Outbreaks of cyclosporiasis, an enteric illness caused by the parasite Cyclospora cayetanensis, have been associated with consumption of various types of fresh produce. Although a method is in use for genotyping C. cayetanensis from clinical specimens, the very low abundance of C. cayetanensis in food and environmental samples presents a greater challenge. To complement epidemiological investigations, a molecular surveillance tool is needed for use in genetic linkage of food vehicles to cyclosporiasis illnesses, estimation of the scope of outbreaks or clusters of illness, and determination of geographical areas involved. We developed a targeted amplicon sequencing (TAS) assay that incorporates a further enrichment step to gain the requisite sensitivity for genotyping C. cayetanensis contaminating fresh produce samples. The TAS assay targets 52 loci, 49 of which are located in the nuclear genome, and encompasses 396 currently known SNP sites. The performance of the TAS assay was evaluated using lettuce, basil, cilantro, salad mix, and blackberries inoculated with C. cayetanensis oocysts. A minimum of 24 markers were haplotyped even at low contamination levels of 10 oocysts in 25 g leafy greens. The artificially contaminated fresh produce samples were included in a genetic distance analysis based on haplotype presence/absence with publicly available C. cayetanensis whole genome sequence assemblies. Oocysts from two different sources were used for inoculation, and samples receiving the same oocyst preparation clustered together, but separately from the other group, demonstrating the utility of the assay for genetically linking samples. Clinical fecal samples with low parasite loads were also successfully genotyped. This work represents a significant advance in the ability to genotype C. cayetanensis contaminating fresh produce along with greatly expanding the genomic diversity included for genetic clustering of clinical specimens.
2003
Cited 46 times
Bethesda Statement on Open Access Publishing
DOI: 10.1128/jvi.53.3.984-987.1985
1985
Cited 45 times
Nucleotide sequence analysis of the BALB/c murine sarcoma virus transforming gene
We determined the nucleotide sequence of the v-H-ras-related oncogene of BALB/c murine sarcoma virus. This oncogene contains an open reading frame of 189 amino acids that initiates and terminates entirely within the mouse cell-derived ras sequence. The protein encoded by this open reading frame matches the sequence predicted for the T24 human bladder carcinoma oncogene product, p21, in all but two positions. The presence of a lysine residue in position 12 of BALB/c murine sarcoma virus p21 likely accounts for its oncogenic properties.
DOI: 10.1007/bf02100090
1985
Cited 42 times
Interaction of silent and replacement changes in eukaryotic coding sequences
DOI: 10.1093/nar/gkp712
2009
Cited 38 times
Selection for minimization of translational frameshifting errors as a factor in the evolution of codon usage
In a wide range of genomes, it was observed that the usage of synonymous codons is biased toward specific codons and codon patterns. Factors that are implicated in the selection for codon usage include facilitation of fast and accurate translation. There are two types of translational errors: missense errors and processivity errors. There is considerable evidence in support of the hypothesis that codon usage is optimized to minimize missense errors. In contrast, little is known about the relationship between codon usage and frameshifting errors, an important form of processivity errors, which appear to occur at frequencies comparable to the frequencies of missense errors. Based on the recently proposed pause-and-slip model of frameshifting, we developed Frameshifting Robustness Score (FRS). We used this measure to test if the pattern of codon usage indicates optimization against frameshifting errors. We found that the FRS values of protein-coding sequences from four analyzed genomes (the bacteria Bacillus subtilis and Escherichia coli, and the yeasts Saccharomyces cerevisiae and Schizosaccharomyce pombe) were typically higher than expected by chance. Other properties of FRS patterns observed in B. subtilis, S. cerevisiae and S. pombe, such as the tendency of FRS to increase from the 5'- to 3'-end of protein-coding sequences, were also consistent with the hypothesis of optimization against frameshifting errors in translation. For E. coli, the results of different tests were less consistent, suggestive of a much weaker optimization, if any. Collectively, the results fit the concept of selection against mistranslation-induced protein misfolding being one of the factors shaping the evolution of both coding and non-coding sequences.
DOI: 10.1016/j.bspc.2023.105809
2024
Personalized nutrition and machine-learning: Exploring the scope of continuous glucose monitoring in healthy individuals in uncontrolled settings
Machine-learning models, when combined with continuous glucose monitoring (CGM), can help effectively analyze extensive datasets of glycemic responses to food. A total of 3,296 90-minute-long CGM meal responses from 927 healthy individuals were used to train/test a machine-learning model (XGBRegressor). The model input were the individual's anthropometric characteristics, macronutrients, and features related to the 24-hour CGM trace preceding the meal. The model output consisted in the parameters of a bell-shaped equation, used to analytically describe the glycemic response to food. To interpret the machine-learning model, the Shapley's values method was employed. A multi-linear regression model was used to study the impact of food macronutrients on the magnitude of the glycemic response. The machine-learning model was able to predict the magnitude of the glycemic response with a root mean square error of 13.2 ± 9.5 mg/dL, and a correlation coefficient of r = 0.48 (p < 0.001) but suffered from a systematic bias (r = 0.83, p < 0.001). The Shapley's values revealed that age brings a positive contribution to the magnitude of the glycemic response beyond 40 years. The multi-linear model (R2 = 0.14, p < 0.001) highlighted the positive and the negative impact of the carbohydrates (β = 0.263, p < 0.001) and fat (β = −0.108, p < 0.001) on the magnitude of the glycemic response, respectively. The maximum attainable accuracy in predicting the glycemic response to food using this machine-learning model may be inherently limited by the uncontrolled nature of the adopted dataset. Nevertheless, this model holds significant educational value for CGM users, as it facilitates comprehension of the intricate relationships between individual characteristics, meal composition, and glucose levels.
DOI: 10.3390/s24030744
2024
Real World Interstitial Glucose Profiles of a Large Cohort of Physically Active Men and Women
The use of continuous glucose monitors (CGMs) in individuals living without diabetes is increasing. The purpose of this study was to profile various CGM metrics around nutritional intake, sleep and exercise in a large cohort of physically active men and women living without any known metabolic disease diagnosis to better understand the normative glycemic response to these common stimuli. A total of 12,504 physically active adults (age 40 ± 11 years, BMI 23.8 ± 3.6 kg/m2; 23% self-identified as women) wore a real-time CGM (Abbott Libre Sense Sport Glucose Biosensor, Abbott, USA) and used a smartphone application (Supersapiens Inc., Atlanta, GA, USA) to log meals, sleep and exercise activities. A total of &gt;1 M exercise events and 274,344 meal events were analyzed. A majority of participants (85%) presented an overall (24 h) average glucose profile between 90 and 110 mg/dL, with the highest glucose levels associated with meals and exercise and the lowest glucose levels associated with sleep. Men had higher mean 24 h glucose levels than women (24 h—men: 100 ± 11 mg/dL, women: 96 ± 10 mg/dL). During exercise, the % time above &gt;140 mg/dL was 10.3 ± 16.7%, while the % time &lt;70 mg/dL was 11.9 ± 11.6%, with the remaining % within the so-called glycemic tight target range (70–140 mg/dL). Average glycemia was also lower for females during exercise and sleep events (p &lt; 0.001). Overall, we see small differences in glucose trends during activity and sleep in females as compared to males and higher levels of both TAR and TBR when these active individuals are undertaking or competing in endurance exercise training and/or competitive events.
DOI: 10.1038/348493c0
1990
Cited 32 times
Equal animals
DOI: 10.1126/science.1131729
2006
Cited 28 times
Comment on "Large-Scale Sequence Analysis of Avian Influenza Isolates"
Obenauer et al. (Research Articles, 17 March 2006, p. 1576) reported that the influenza A virus PB1-F2 gene is evolving under strong positive selection, as documented by an extremely high ratio of the number of nonsynonymous nucleotide substitutions to the number of synonymous substitutions (dN/dS). However, we show that this observation is likely to be an artifact related to the location of PB1-F2 in the +1 reading frame of the PB1 gene.
DOI: 10.1093/gbe/evq010
2010
Cited 24 times
Relative Contributions of Intrinsic Structural–Functional Constraints and Translation Rate to the Evolution of Protein-Coding Genes
A long-standing assumption in evolutionary biology is that the evolution rate of protein-coding genes depends, largely, on specific constraints that affect the function of the given protein. However, recent research in evolutionary systems biology revealed unexpected, significant correlations between evolution rate and characteristics of genes or proteins that are not directly related to specific protein functions, such as expression level and protein-protein interactions. The strongest connections were consistently detected between protein sequence evolution rate and the expression level of the respective gene. A recent genome-wide proteomic study revealed an extremely strong correlation between the abundances of orthologous proteins in distantly related animals, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster. We used the extensive protein abundance data from this study along with short-term evolutionary rates (ERs) of orthologous genes in nematodes and flies to estimate the relative contributions of structural-functional constraints and the translation rate to the evolution rate of protein-coding genes. Together the intrinsic constraints and translation rate account for approximately 50% of the variance of the ERs. The contribution of constraints is estimated to be 3- to 5-fold greater than the contribution of translation rate.
DOI: 10.1371/currents.rrn1200
2010
Cited 14 times
Projection of seasonal influenza severity from sequence and serological data
Severity of seasonal influenza A epidemics is related to the antigenic novelty of the predominant viral strains circulating each year. Support for a strong correlation between epidemic severity and antigenic drift comes from infectious challenge experiments on vaccinated animals and human volunteers, field studies of vaccine efficacy, prospective studies of subjects with laboratory-confirmed prior infections, and analysis of the connection between drift and severity from surveillance data. We show that, given data on the antigenic and sequence novelty of the hemagglutinin protein of clinical isolates of H3N2 virus from a season along with the corresponding data from prior seasons, we can accurately predict the influenza severity for that season. This model therefore provides a framework for making projections of the severity of the upcoming season using assumptions based on viral isolates collected in the current season. Our results based on two independent data sets from the US and Hong Kong suggest that seasonal severity is largely determined by the novelty of the hemagglutinin protein although other factors, including mutations in other influenza genes, co-circulating pathogens and weather conditions, might also play a role. These results should be helpful for the control of seasonal influenza and have implications for improvement of influenza surveillance.
DOI: 10.1371/journal.pone.0039435
2012
Cited 11 times
Complex Patterns of Human Antisera Reactivity to Novel 2009 H1N1 and Historical H1N1 Influenza Strains
During the 2009 influenza pandemic, individuals over the age of 60 had the lowest incidence of infection with approximately 25% of these people having pre-existing, cross-reactive antibodies to novel 2009 H1N1 influenza isolates. It was proposed that older people had pre-existing antibodies induced by previous 1918-like virus infection(s) that cross-reacted to novel H1N1 strains.Using antisera collected from a cohort of individuals collected before the second wave of novel H1N1 infections, only a minority of individuals with 1918 influenza specific antibodies also demonstrated hemagglutination-inhibition activity against the novel H1N1 influenza. In this study, we examined human antisera collected from individuals that ranged between the ages of 1 month and 90 years to determine the profile of seropositive influenza immunity to viruses representing H1N1 antigenic eras over the past 100 years. Even though HAI titers to novel 2009 H1N1 and the 1918 H1N1 influenza viruses were positively associated, the association was far from perfect, particularly for the older and younger age groups.Therefore, there may be a complex set of immune responses that are retained in people infected with seasonal H1N1 that can contribute to the reduced rates of H1N1 influenza infection in older populations.
DOI: 10.1088/1538-3873/aafe86
2019
Cited 10 times
The Breakthrough Listen Search for Intelligent Life: Searching Boyajian's Star for Laser Line Emission
Boyajian' s Star (KIC 8462852) has received significant attention due to its unusual periodic brightness fluctuations detected by the Kepler Spacecraft and subsequent ground based observations. Possible explanations for the dips in the photometric measurements include interstellar or circumstellar dust, and it has been speculated that an artificial megastructure could be responsible. We analyze 177 high-resolution spectra of Boyajian's Star in an effort to detect potential laser signals from extraterrestrial civilizations. The spectra were obtained by the Lick Observatory's Automated Planet Finder telescope as part of the Breakthrough Listen Project, and cover the wavelength range of visible light from 374 to 970 nm. We calculate that the APF would be capable of detecting lasers of power greater than approximately 24 MW at the distance of Boyajian's Star, d = 1470 ly. The top candidates from the analysis can all be explained as either cosmic ray hits, stellar emission lines or atmospheric air glow emission lines.
DOI: 10.1093/nar/gkg751
2003
Cited 19 times
Patterns in interspecies similarity correlate with nucleotide composition in mammalian 3'UTRs
Post-transcriptional regulation and the formation of mRNA 3' ends are crucial for gene expression in eukaryotes. Interspecies conservation of many sequences within 3'UTRs reveals selective constraint due to similar function. To study the pattern of conservation within 3'UTRs, we compiled and aligned 50 sets of complete orthologous 3'UTRs from four orders of mammals. We observed a mosaic pattern of conservation, with alternating regions of high (phylogenetic footprints) and low similarity. Conservation in 3'UTRs correlates with their base composition and also with the synonymous substitution rate in corresponding coding regions. The non-uniform distribution of conservation is more pronounced for 3'UTRs with a moderate or low level of overall conservation, where invariant nucleotides are more numerous, and their runs of lengths 4-7 occur more frequently than if conservation were random. Many runs of invariant nucleotides are AU-rich or pyrimidine-rich. Some of these runs coincide with known functional cis- elements of eukaryotic mRNAs, such as the U-rich upstream element, polyadenylation signal and DICE regulatory signal. More divergent regions of multiple alignments of 3'UTRs are often more G- and/or C-rich. Our results provide evidence on the importance of moderately conserved regions in 3'UTRs and suggest that regulatory functions of 3'UTRs might utilize gene-specific information in these regions.
DOI: 10.1016/0022-2836(79)90307-3
1979
Cited 16 times
Structural organization of Escherichia coli tRNATyr gene clusters in four different transducing bacteriophages
The structural organization of the genes coding for tyrosine-accepting transfer RNA in Escherichia coli has been studied by restriction enzyme analysis and DNA sequencing, utilizing four different specialized transducing bacteriophages. The non-defective transducing phage, φ80psuIII+,− (doublet), carries two 85-base-pair sequences corresponding to mature tRNATyr1, separated by a 200-base-pair “spacer”. A derivative of this phage, φ80psu+III (singlet), has lost one of the two 85-base-pair tRNATyr1 sequences and the 200-base-pair spacer. The sequences upstream and downstream of the tRNATyr1 mature structural sequence are identical in both phages. The gene coding for tRNA.Tyr2has been studied using two specialized transducing phages of completely different origin from each other, λh80dglyTsu+36 and λrifd18. A portion of the E. coli DNA carried by these two pliages is identical: three tRNA genes are clustered in a region of approximately 350 base-pairs with the arrangement and orientation (relative to transcription) 5′ … tRNATyr2-tRNAGly2-tRNAThr3. The sequences corresponding to mature tRNAGly2 and tRNAThr3 are separated from each other by six base-pairs and from tRNATyr2by approximately 100 base-pairs. Surprisingly, sequences upstream of the tRNATyr2 gone are completely different in the two transducing phages, although sequences including, and downstream of, the tRNATyr2 gene are the same. The region immediately upstream of the tRNATry2 gene on λh80dglyTsu+36 is identical, or nearly identical, for at least 500 base-pairs to the corresponding region upstream of tRNATry1 in the φ80psuIII+,− (doublet) and φ80psu+III (singlet) transducing phages. In contrast, the E. coli DNA in λrifd18 has an additional tRNA gene immediately upstream of the tRNATyr2 gene, and there is little or no homology with the upstream sequences of any of the other phages. It thus appears that the regions surrounding and including the tRNATyr2 gene on λh80dglyTsu+36 are a hybrid of tRNATyr1 upstream and tRNATyr2-tRNAGly2-tRNAThr3 downstream sequences. All of the five tRNA genes studied are colinear with the known sequences for their respective mature tRNAs. This rules out the possibility of inserted sequences within those coding for the mature tRNAs, which are removed during processing to a mature tRNA. The implications of these tRNA gene structural studies for extrapolation back to the E. coli chromosome, as well as the relationships between gene structure and transcriptional analyses, are discussed.
DOI: 10.1093/nar/10.8.2723
1982
Cited 16 times
Comparative analysis of nucleic acid sequences by their general constraints
we describe two measures of a nucleic acid sequence, derived from Information Theory, which characterize the constraints toward nonuniform base composition, and the constraints on the ordering of the bases. These two measures distinguish extra-chromosomal coding sequences from all other coding sequences examined. The two measures separate eukaryotic coding sequences into two groups: those with introns and those without introns. We have also found a relationship between the general constraints of a subsequence and its degree of conservation in related genes.
DOI: 10.1080/17461391.2023.2233468
2023
Association between pre-exercise food ingestion timing and reactive hypoglycemia: Insights from a large database of continuous glucose monitoring data
Using a large database of continuous glucose monitoring (CGM) data, this study aimed to gain insights into the association between pre-exercise food ingestion timing and reactive hypoglycemia. A group of 6,761 users self-reported 48,799 pre-exercise food ingestion events and logged minute-by-minute CGM, which was used to detect reactive hypoglycemia (<70 mg/dL) in the first 30 min of exercise. A linear and a non-linear binomial logistic regression model was used to investigate the association between food ingestion timing and the probability of experiencing reactive hypoglycemia. An analysis of variance was conducted to compare the predictive ability of the models. On average, reactive hypoglycemia was detected in 8.34 ± 3.04% of the total events, with <15% of individuals experiencing hypoglycemia in >20% of their events. The majority of the reactive hypoglycemia events were found with pre-exercise food timing between ∼30 and ∼90 min, with a peak at ∼60 min. The superior accuracy (62.05 vs 45.1%) and F-score (0.75 vs 0.59) of the non-linear vs the linear model were statistically superior (P < 0.0001). These results support the notion of an unfavourable 30-to-90 min pre-exercise food ingestion time window which can significantly impact the likelihood of reactive hypoglycemia in some individuals.
DOI: 10.5944/openpraxis.11.3.960
2019
Cited 6 times
Do tutors make a difference in online learning? A comparative study in two Open Online Courses
Two free fully online courses were offered by Peoples-uni on its Open Online Courses site, both as self-paced courses available any time and as courses run over four weeks with tutor-led discussions. We tested the hypothesis that there are no measurable differences in outcomes between the two delivery methods. Similar numbers attended both versions of each course; students came from multiple countries and backgrounds. Numbers of discussion forum posts were greater in tutor-led than self-paced courses. Measured outcomes of certificates of completion, quiz completion and marks gained were very similar and not statistically significantly different between the tutor-led and the self-paced versions of either course. In light of little discernible difference in outcome between self-paced learning compared with courses including tutor-led discussions, the utility of the time cost to tutors is in question. The findings may be relevant to others designing online courses, including MOOCs.
DOI: 10.1016/0167-2789(86)90066-7
1986
Cited 11 times
On the prediction of local patterns in cellular automata
The class of deterministic one-dimensional cellular automata studied recently by Wolfram are considered. We represent a state of an automaton as probability distribution of patterns of a fixed size. In this way information is lost but it is possible to approximate the stepwise action of the automaton by the iteration of an analytic mapping of the set of probability distributions to itself. Such nonlinear analytic mappings generally have nontrivial attrators and in the most interesting cases (Wolfram Class III) these are single points. The point attractors under appropriate circumstances provide good approximations to the frequencies of local patterns generated by the discrete rules from which they were derived. Two appropriate settings for such approximation are transient patterns generated from random starts and patterns generated in a noisy environment. In the case with noise improvement is found by correction of the analytic mappings for the effects of noise. Examples of both types of approximation are considered.
DOI: 10.1093/gbe/evt024
2013
Cited 5 times
Why Does a Protein’s Evolutionary Rate Vary over Time?
The sequences of different proteins evolve at different rates. The relative evolutionary rate (ER) of a single protein also changes over evolutionary time. The cause of this ER fluctuation remains uncertain, and study of this phenomenon may shed light on protein evolution more broadly. We have characterized ER fluctuation in mammals and Drosophila. We found little correlation between the amount of rate variation observed for a protein and such factors as its expression level or phylogenetic distribution. Perhaps more surprisingly, we found little correlation between our measure of rate variation and ER itself. We also investigated the extent to which the ERs of different domains of a protein vary independently. We found that rates of different domains do tend to vary together. In fact, rates at positions in different domains are coupled just as strongly as rates at equally distant positions in the same domain. These findings provide clues to the protein evolutionary process.
DOI: 10.1186/1745-6150-8-11
2013
Cited 4 times
Biology Direct:celebrating 7 years of open, published peer review
Biology Direct, an online open access journal published by BioMed Central, is celebrating its 7th anniversary. Biology Direct started as an experiment, perhaps a daring one, on a new system of open peer review, under which the signed reviews and the author responses are published as an integral part of the final version of each article. The goals of the journal were set high: we strived to establish a new system of peer review that we hoped would avoid the all too obvious pitfalls of anonymous peer review. In addition, we expected that Biology Direct would generate productive scientific debate that would substantially add to the content of an article, in particular by alerting readers to potential problems with the reviewed work as well as additional relevant data and ideas [1, 2].
DOI: 10.1038/35071270
2001
Cited 9 times
PubMed Central decentralized
DOI: 10.1002/bip.360260106
1987
Cited 9 times
Local sequence patterns of hydrophobicity and solvent accessibility in soluble globular proteins
Abstract We examined the variation in the solvent accessibility and hydrophobicity of the amino acids along the sequences of 58 soluble globular proteins with known tertiary structure. We found that there is a significant tendency for the accessibilities to run in clusters along the sequence but that the hydrophobicities are distributed without such nonrandom clusters. Theseresults suggest severe limitations on the power of sequence analysis tools that use average hydrophobicity scores of overlapping subsequences to predict accessibility.
DOI: 10.1371/journal.ppat.0020138
2006
Cited 5 times
Correction: Stochastic Processes Are Key Determinants of Short-Term Evolution in Influenza A Virus
In PLoS Pathogens, volume 2, issue 12: doi: 10.1371/journal.ppat.0020125 The spelling of author name Elodie Ghedi is incorrect; the correct name is Elodie Ghedin.
DOI: 10.1186/1471-2148-8-208
2008
Cited 3 times
Differences in evolutionary pressure acting within highly conserved ortholog groups
In highly conserved widely distributed ortholog groups, the main evolutionary force is assumed to be purifying selection that enforces sequence conservation, with most divergence occurring by accumulation of neutral substitutions. Using a set of ortholog groups from prokaryotes, with a single representative in each studied organism, we asked the question if this evolutionary pressure is acting similarly on different subgroups of orthologs defined as major lineages (e.g. Proteobacteria or Firmicutes).Using correlations in entropy measures as a proxy for evolutionary pressure, we observed two distinct behaviors within our ortholog collection. The first subset of ortholog groups, called here informational, consisted mostly of proteins associated with information processing (i.e. translation, transcription, DNA replication) and the second, the non-informational ortholog groups, mostly comprised of proteins involved in metabolic pathways. The evolutionary pressure acting on non-informational proteins is more uniform relative to their informational counterparts. The non-informational proteins show higher level of correlation between entropy profiles and more uniformity across subgroups.The low correlation of entropy profiles in the informational ortholog groups suggest that the evolutionary pressure acting on the informational ortholog groups is not uniform across different clades considered this study. This might suggest "fine-tuning" of informational proteins in each lineage leading to lineage-specific differences in selection. This, in turn, could make these proteins less exchangeable between lineages. In contrast, the uniformity of the selective pressure acting on the non-informational groups might allow the exchange of the genetic material via lateral gene transfer.
DOI: 10.1249/01.mss.0000987112.85936.98
2023
Greater Number Of Cardio Sessions, Protein For Breakfast, And Reducing Fatigue Can Minimize Glucose Variability
High variability in blood glucose is associated with earlier onset of disease in healthy adults without diabetes. More specifically, prospective studies demonstrate that higher glucose variability is associated with an increased risk of cardiovascular disease, Alzheimer’s disease, frailty, cardiovascular death, and cancer death compared to lower glucose variability. Similarly, other prospective research illustrates that these fluctuations induce endothelial dysfunction and may accelerate the development of atherosclerosis. Unfortunately, there are limited data on glucose concentrations in individuals without diabetes. PURPOSE: To correlate lifestyle variables- exercise, nutrition, sleep, emotions- with glucose variability. METHODS: Thirty-five healthy, active adults (8 women, mean age 47 + 8 years) wore a continuous glucose monitor for two weeks, maintained their typical routines, and recorded the data. They also completed each planned exercise session with a heart rate chest transmitter. The study participants logged these training sessions (total time, intensity zones, perceived exertion) as well as daily meals (time of day, macronutrient grams), sleep (total time, subjective quality), and emotions (stress, motivation, fatigue). RESULTS: Daily glucose variability was significantly correlated (n = 532) with number of cardio sessions per day (ρ = -0.19, p < 0.0001), protein grams within the first meal of the day (ρ = -0.14, p < 0.0001), percent fat per day (ρ = -0.12, p < 0.0001), and subjective fatigue (ρ = 0.14, p < 0.0001). CONCLUSIONS: Our data demonstrate that there are multiple lifestyle factors that can minimize glucose variability and thereby potentially lower future disease risk. With respect to planned exercise, a greater number of independent cardio sessions is more impactful than a singular session for a longer duration. In terms of nutrition, greater protein grams at breakfast and a higher daily fat percentage lessen glucose variability. And finally, reducing fatigue through lifestyle choices may diminish detrimental fluctuations.
DOI: 10.1038/493026a
2013
NIH funding: Agency rebuts critique
2012
Maybe I'll Pitch Forever: A Great Baseball Player Tells The Hilarious Story Behind The Legend
Not only was Satchel Paige an amazing athlete, he was one of the great American humorists in the tradition of Mark Twain, Will Rogers, and Yogi Berra. The most famous black player of his era shines through the pages of this remarkable autobiography. (John B. Holway). Lippman...has preserved the flavor and cadence of Paige's conversation and writes his story honestly, avoiding neither the tragedies nor the escapades which mark his career. (Booklist). Satchel Paige was forty-two years old in 1948 when he became the first black pitcher in the American League. Although the oldest rookie around, he was already a legend. For twenty-two years, beginning in 1926, Page dazzled throngs with his performance in the Negro Baseball Leagues. Then he outlasted everyone by playing professional baseball, in and out of the majors, until 1965. Struggle - against early poverty and racial discrimination - was part of Paige's story. So was fast living and a humorous point of view. His immortal advice was Don't look back. Something might be gaining on you. That inimitable personality is recalled in an introduction by John B. Holway, the author of Voices from the Great Black Baseball Leagues (1992). David Lipman's afterword describes the last twenty years of Paige's life, including the proud moment in 1971 when he became one of the first three great players from the Negro Leagues to be inducted into the Baseball Hall of Fame.
DOI: 10.1038/nature28053
2001
PubMed Central decides to decentralize
DOI: 10.1186/gb-2000-1-3-comment2003
2000
A series of reports - and extracts of reports - from the Freedom of Information Conference, 6-7 July, 2000, New York Academy of Medicine. The conference was sponsored by BioMed Central, to promote debate about the communication and validation of biomedical research published on the internet. Details of the meeting and all presentations are available in full online at http://biomedcentral.com/info/conference.asp
DOI: 10.1371/journal/ppat.0020138
2006
Erratum: Stochastic processes are key determinants of short-term evolution in influenza A virus (PLoS pathogens 2, 12 DOI: 10.1371/journal.ppat.0020125)
DOI: 10.1093/nar/10.17.5375
1982
Hierarchical analysis of influenza A hemagglutinin gene sequences
Five recently sequenced hemagglutinin genes from Influenza A virus strains are studied for similarities in a hierarchical fashion. The sequences are compared for similarity, first on the level of sequence homology, and then on several progressively more general levels. Though the HA1 subsequences contain regions where homology drops to that of a Monte Carlo generated reference value, subsequent tests reveal great similarity due to constraints on the level of amino acid sequence. Other tests detect statistically significant differences between subtypes due to constraints acting below the level of amino acid sequence, such as the 2 degrees structure of the viral RNA, or involving translation of the mRNA. The general applicability of the hierarchical approach to sequence analysis is discussed.
DOI: 10.1109/bibmw.2011.6112529
2011
An approach to phylogenomic analysis of bacterial pathogens
From the beginning of the microbial genome sequencing era, researchers have shown a commendable commitment to phylogenetic diversity. The completion of one genome from each prokaryotic division or phylum is still a frequently articulated community goal. However, largely because of the interest in human pathogens and advances in sequencing technologies, there are also now a number of very closely related genomes whose organization and gene content can be directly compared. Studying genetic variability of pathogenic bacteria using whole-genome sequencing provides a way to understanding the mechanism of bacterial adaptation to rapid environmental changes and can be a source of useful information on virulence mechanisms. The bacterial genome datasets available in public archives represent a large collection of genome at different levels of sequence quality and assembly. A fast and reliable method of phylogenetic classification based on genome sequences provides a necessary foundation for a more detailed comparative analysis. NCBI has developed an approach of grouping bacterial organisms into phylogenetic clades using a genome dissimilarity measure based on the comparison of universally conserved markers. Special adjustments have been made to compensate for data inaccuracy and incompleteness. Tests performed on complete and draft genomes from phylum Proteobacteria demonstrated that the proposed robust genomic distance allows stable and reliable species-level clustering and can be used for forming phylogenetic clades. Since the tradeoff for the increased robustness of the method is its limited sensitivity at a very fine level, a phylogenomic refinement could be done within each constructed clade when file-level phylogenetic resolution of close genomes is necessary.
DOI: 10.1186/1471-8219-1-11
2000
PubMed Central: still on course to revolutionise biomedical publishing
DOI: 10.7916/d85b09q5
2010
Ten Years of PubMed Central
2016
Knowledge and Role Modelling Deficiencies in the Physical Activity Realm, Significant Intervention Required in Australian Medical Students; MEDx Update
DOI: 10.1186/s13062-015-0038-9
2015
Reviewer acknowledgement
The editors of Biology Direct would like to thank all the reviewers who have contributed to the journal in Volume 9 (2014).