ϟ

Jason R. Miller

Here are all the papers by Jason R. Miller that you can download and read on OA.mg.
Jason R. Miller’s last known institution is . Download Jason R. Miller PDFs here.

Claim this Profile →
DOI: 10.1101/gr.215087.116
2017
Cited 5,461 times
Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation
Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of &gt;21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
DOI: 10.1371/journal.pone.0046688
2012
Cited 2,499 times
Predicting the Functional Effect of Amino Acid Substitutions and Indels
As next-generation sequencing projects generate massive genome-wide sequence variation data, bioinformatics tools are being developed to provide computational predictions on the functional effects of sequence variations and narrow down the search of casual variants for disease phenotypes. Different classes of sequence variations at the nucleotide level are involved in human diseases, including substitutions, insertions, deletions, frameshifts, and non-sense mutations. Frameshifts and non-sense mutations are likely to cause a negative effect on protein function. Existing prediction tools primarily focus on studying the deleterious effects of single amino acid substitutions through examining amino acid conservation at the position of interest among related sequences, an approach that is not directly applicable to insertions or deletions. Here, we introduce a versatile alignment-based score as a new metric to predict the damaging effects of variations not limited to single amino acid substitutions but also in-frame insertions, deletions, and multiple amino acid substitutions. This alignment-based score measures the change in sequence similarity of a query sequence to a protein sequence homolog before and after the introduction of an amino acid variation to the query sequence. Our results showed that the scoring scheme performs well in separating disease-associated variants (n = 21,662) from common polymorphisms (n = 37,022) for UniProt human protein variations, and also in separating deleterious variants (n = 15,179) from neutral variants (n = 17,891) for UniProt non-human protein variations. In our approach, the area under the receiver operating characteristic curve (AUC) for the human and non-human protein variation datasets is ∼0.85. We also observed that the alignment-based score correlates with the deleteriousness of a sequence variation. In summary, we have developed a new algorithm, PROVEAN (Protein Variation Effect Analyzer), which provides a generalized approach to predict the functional effects of protein sequence variations including single or multiple amino acid substitutions, and in-frame insertions and deletions. The PROVEAN tool is available online at http://provean.jcvi.org.
DOI: 10.1126/science.1076181
2002
Cited 1,927 times
The Genome Sequence of the Malaria Mosquito <i>Anopheles gambiae</i>
Anopheles gambiae is the principal vector of malaria, a disease that afflicts more than 500 million people and causes more than 1 million deaths each year. Tenfold shotgun sequence coverage was obtained from the PEST strain of A. gambiae and assembled into scaffolds that span 278 million base pairs. A total of 91% of the genome was organized in 303 scaffolds; the largest scaffold was 23.1 million base pairs. There was substantial genetic variation within this strain, and the apparent existence of two haplotypes of approximately equal frequency ("dual haplotypes") in a substantial fraction of the genome likely reflects the outbred nature of the PEST strain. The sequence produced a conservative inference of more than 400,000 single-nucleotide polymorphisms that showed a markedly bimodal density distribution. Analysis of the genome sequence revealed strong evidence for about 14,000 protein-encoding transcripts. Prominent expansions in specific families of proteins likely involved in cell adhesion and immunity were noted. An expressed sequence tag analysis of genes regulated by blood feeding provided insights into the physiological adaptations of a hematophagous insect.
DOI: 10.1126/science.1138878
2007
Cited 1,060 times
Genome Sequence of <i>Aedes aegypti</i> , a Major Arbovirus Vector
We present a draft sequence of the genome of Aedes aegypti , the primary vector for yellow fever and dengue fever, which at ∼1376 million base pairs is about 5 times the size of the genome of the malaria vector Anopheles gambiae . Nearly 50% of the Ae. aegypti genome consists of transposable elements. These contribute to a factor of ∼4 to 6 increase in average gene length and in sizes of intergenic regions relative to An. gambiae and Drosophila melanogaster . Nonetheless, chromosomal synteny is generally maintained among all three insects, although conservation of orthologous gene order is higher (by a factor of ∼2) between the mosquito species than between either of them and the fruit fly. An increase in genes encoding odorant binding, cytochrome P450, and cuticle domains relative to An. gambiae suggests that members of these protein families underpin some of the biological differences between the two mosquito species.
DOI: 10.1016/j.ygeno.2010.03.001
2010
Cited 1,018 times
Assembly algorithms for next-generation sequencing data
The emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly.
DOI: 10.1038/nature17164
2016
Cited 907 times
The Atlantic salmon genome provides insights into rediploidization
Abstract The whole-genome duplication 80 million years ago of the common ancestor of salmonids (salmonid-specific fourth vertebrate whole-genome duplication, Ss4R) provides unique opportunities to learn about the evolutionary fate of a duplicated vertebrate genome in 70 extant lineages. Here we present a high-quality genome assembly for Atlantic salmon ( Salmo salar ), and show that large genomic reorganizations, coinciding with bursts of transposon-mediated repeat expansions, were crucial for the post-Ss4R rediploidization process. Comparisons of duplicate gene expression patterns across a wide range of tissues with orthologous genes from a pre-Ss4R outgroup unexpectedly demonstrate far more instances of neofunctionalization than subfunctionalization. Surprisingly, we find that genes that were retained as duplicates after the teleost-specific whole-genome duplication 320 million years ago were not more likely to be retained after the Ss4R, and that the duplicate retention was not influenced to a great extent by the nature of the predicted protein interactions of the gene products. Finally, we demonstrate that the Atlantic salmon assembly can serve as a reference sequence for the study of other salmonids for a range of purposes.
DOI: 10.1126/science.1183605
2010
Cited 597 times
A Catalog of Reference Genomes from the Human Microbiome
News from the Inner Tube of Life A major initiative by the U.S. National Institutes of Health to sequence 900 genomes of microorganisms that live on the surfaces and orifices of the human body has established standardized protocols and methods for such large-scale reference sequencing. By combining previously accumulated data with new data, Nelson et al. (p. 994 ) present an initial analysis of 178 bacterial genomes. The sampling so far barely scratches the surface of the microbial diversity found on humans, but the work provides an important baseline for future analyses.
DOI: 10.1093/bioinformatics/btn548
2008
Cited 535 times
Aggressive assembly of pyrosequencing reads with mates
DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data.The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.
DOI: 10.1038/nature11128
2012
Cited 474 times
The bonobo genome compared with the chimpanzee and human genomes
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
DOI: 10.1038/ncomms10507
2016
Cited 411 times
Genomic insights into the Ixodes scapularis tick vector of Lyme disease
Abstract Ticks transmit more pathogens to humans and animals than any other arthropod. We describe the 2.1 Gbp nuclear genome of the tick, Ixodes scapularis (Say), which vectors pathogens that cause Lyme disease, human granulocytic anaplasmosis, babesiosis and other diseases. The large genome reflects accumulation of repetitive DNA, new lineages of retro-transposons, and gene architecture patterns resembling ancient metazoans rather than pancrustaceans. Annotation of scaffolds representing ∼57% of the genome, reveals 20,486 protein-coding genes and expansions of gene families associated with tick–host interactions. We report insights from genome analyses into parasitic processes unique to ticks, including host ‘questing’, prolonged feeding, cuticle synthesis, blood meal concentration, novel methods of haemoglobin digestion, haem detoxification, vitellogenesis and prolonged off-host survival. We identify proteins associated with the agent of human granulocytic anaplasmosis, an emerging disease, and the encephalitis-causing Langat virus, and a population structure correlated to life-history traits and transmission of the Lyme disease agent.
DOI: 10.1126/science.1069193
2002
Cited 356 times
A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome
The high degree of similarity between the mouse and human genomes is demonstrated through analysis of the sequence of mouse chromosome 16 (Mmu 16), which was obtained as part of a whole-genome shotgun assembly of the mouse genome. The mouse genome is about 10% smaller than the human genome, owing to a lower repetitive DNA content. Comparison of the structure and protein-coding potential of Mmu 16 with that of the homologous segments of the human genome identifies regions of conserved synteny with human chromosomes (Hsa) 3, 8, 12, 16, 21, and 22. Gene content and order are highly conserved between Mmu 16 and the syntenic blocks of the human genome. Of the 731 predicted genes on Mmu 16, 509 align with orthologs on the corresponding portions of the human genome, 44 are likely paralogous to these genes, and 164 genes have homologs elsewhere in the human genome; there are 14 genes for which we could find no human counterpart.
DOI: 10.1105/tpc.17.00073
2017
Cited 297 times
ePlant: Visualizing and Exploring Multiple Levels of Data for Hypothesis Generation in Plant Biology
A big challenge in current systems biology research arises when different types of data must be accessed from separate sources and visualized using separate tools. The high cognitive load required to navigate such a workflow is detrimental to hypothesis generation. Accordingly, there is a need for a robust research platform that incorporates all data and provides integrated search, analysis, and visualization features through a single portal. Here, we present ePlant (http://bar.utoronto.ca/eplant), a visual analytic tool for exploring multiple levels of Arabidopsis thaliana data through a zoomable user interface. ePlant connects to several publicly available web services to download genome, proteome, interactome, transcriptome, and 3D molecular structure data for one or more genes or gene products of interest. Data are displayed with a set of visualization tools that are presented using a conceptual hierarchy from big to small, and many of the tools combine information from more than one data type. We describe the development of ePlant in this article and present several examples illustrating its integrative features for hypothesis generation. We also describe the process of deploying ePlant as an “app” on Araport. Building on readily available web services, the code for ePlant is freely available for any other biological species research.
DOI: 10.1186/s12915-017-0402-6
2017
Cited 247 times
Genomic innovations, transcriptional plasticity and gene loss underlying the evolution and divergence of two highly polyphagous and invasive Helicoverpa pest species
Helicoverpa armigera and Helicoverpa zea are major caterpillar pests of Old and New World agriculture, respectively. Both, particularly H. armigera, are extremely polyphagous, and H. armigera has developed resistance to many insecticides. Here we use comparative genomics, transcriptomics and resequencing to elucidate the genetic basis for their properties as pests.We find that, prior to their divergence about 1.5 Mya, the H. armigera/H. zea lineage had accumulated up to more than 100 more members of specific detoxification and digestion gene families and more than 100 extra gustatory receptor genes, compared to other lepidopterans with narrower host ranges. The two genomes remain very similar in gene content and order, but H. armigera is more polymorphic overall, and H. zea has lost several detoxification genes, as well as about 50 gustatory receptor genes. It also lacks certain genes and alleles conferring insecticide resistance found in H. armigera. Non-synonymous sites in the expanded gene families above are rapidly diverging, both between paralogues and between orthologues in the two species. Whole genome transcriptomic analyses of H. armigera larvae show widely divergent responses to different host plants, including responses among many of the duplicated detoxification and digestion genes.The extreme polyphagy of the two heliothines is associated with extensive amplification and neofunctionalisation of genes involved in host finding and use, coupled with versatile transcriptional responses on different hosts. H. armigera's invasion of the Americas in recent years means that hybridisation could generate populations that are both locally adapted and insecticide resistant.
DOI: 10.1093/nar/gku1200
2014
Cited 178 times
Araport: the Arabidopsis Information Portal
The Arabidopsis Information Portal (https://www.araport.org) is a new online resource for plant biology research. It houses the Arabidopsis thaliana genome sequence and associated annotation. It was conceived as a framework that allows the research community to develop and release 'modules' that integrate, analyze and visualize Arabidopsis data that may reside at remote sites. The current implementation provides an indexed database of core genomic information. These data are made available through feature-rich web applications that provide search, data mining, and genome browser functionality, and also by bulk download and web services. Araport uses software from the InterMine and JBrowse projects to expose curated data from TAIR, GO, BAR, EBI, UniProt, PubMed and EPIC CoGe. The site also hosts 'science apps,' developed as prototypes for community modules that use dynamic web pages to present data obtained on-demand from third-party servers via RESTful web services. Designed for sustainability, the Arabidopsis Information Portal strategy exploits existing scientific computing infrastructure, adopts a practical mixture of data integration technologies and encourages collaborative enhancement of the resource by its user community.
DOI: 10.1186/s12864-016-3448-x
2017
Cited 152 times
An improved genome assembly uncovers prolific tandem repeats in Atlantic cod
The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies. By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21% of the TRs across the assembly, 19% in the promoter regions and 12% in the coding sequences are heterozygous in the sequenced individual. The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.
DOI: 10.1073/pnas.1102838108
2011
Cited 193 times
Genetic diversity and population structure of the endangered marsupial <i>Sarcophilus harrisii</i> (Tasmanian devil)
The Tasmanian devil (Sarcophilus harrisii) is threatened with extinction because of a contagious cancer known as Devil Facial Tumor Disease. The inability to mount an immune response and to reject these tumors might be caused by a lack of genetic diversity within a dwindling population. Here we report a whole-genome analysis of two animals originating from extreme northwest and southeast Tasmania, the maximal geographic spread, together with the genome from a tumor taken from one of them. A 3.3-Gb de novo assembly of the sequence data from two complementary next-generation sequencing platforms was used to identify 1 million polymorphic genomic positions, roughly one-quarter of the number observed between two genetically distant human genomes. Analysis of 14 complete mitochondrial genomes from current and museum specimens, as well as mitochondrial and nuclear SNP markers in 175 animals, suggests that the observed low genetic diversity in today's population preceded the Devil Facial Tumor Disease disease outbreak by at least 100 y. Using a genetically characterized breeding stock based on the genome sequence will enable preservation of the extant genetic diversity in future Tasmanian devil populations.
DOI: 10.1073/pnas.0307971100
2004
Cited 174 times
Whole-genome shotgun assembly and comparison of human genome assemblies
We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.
DOI: 10.1101/071282
2016
Cited 131 times
Canu: scalable and accurate long-read assembly via adaptive<i>k</i>-mer weighting and repeat separation
Abstract Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, and achieves a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
DOI: 10.1186/s12864-017-3654-1
2017
Cited 86 times
Exploring structural variation and gene family architecture with De Novo assemblies of 15 Medicago genomes
Previous studies exploring sequence variation in the model legume, Medicago truncatula, relied on mapping short reads to a single reference. However, read-mapping approaches are inadequate to examine large, diverse gene families or to probe variation in repeat-rich or highly divergent genome regions. De novo sequencing and assembly of M. truncatula genomes enables near-comprehensive discovery of structural variants (SVs), analysis of rapidly evolving gene families, and ultimately, construction of a pan-genome.Genome-wide synteny based on 15 de novo M. truncatula assemblies effectively detected different types of SVs indicating that as much as 22% of the genome is involved in large structural changes, altogether affecting 28% of gene models. A total of 63 million base pairs (Mbp) of novel sequence was discovered, expanding the reference genome space for Medicago by 16%. Pan-genome analysis revealed that 42% (180 Mbp) of genomic sequences is missing in one or more accession, while examination of de novo annotated genes identified 67% (50,700) of all ortholog groups as dispensable - estimates comparable to recent studies in rice, maize and soybean. Rapidly evolving gene families typically associated with biotic interactions and stress response were found to be enriched in the accession-specific gene pool. The nucleotide-binding site leucine-rich repeat (NBS-LRR) family, in particular, harbors the highest level of nucleotide diversity, large effect single nucleotide change, protein diversity, and presence/absence variation. However, the leucine-rich repeat (LRR) and heat shock gene families are disproportionately affected by large effect single nucleotide changes and even higher levels of copy number variation.Analysis of multiple M. truncatula genomes illustrates the value of de novo assemblies to discover and describe structural variation, something that is often under-estimated when using read-mapping approaches. Comparisons among the de novo assemblies also indicate that different large gene families differ in the architecture of their structural variation.
DOI: 10.1093/bioinformatics/btn074
2008
Cited 109 times
Consensus generation and variant detection by Celera Assembler
Abstract Motivation: We present an algorithm to identify allelic variation given a Whole Genome Shotgun (WGS) assembly of haploid sequences, and to produce a set of haploid consensus sequences rather than a single consensus sequence. Existing WGS assemblers take a column-by-column approach to consensus generation, and produce a single consensus sequence which can be inconsistent with the underlying haploid alleles, and inconsistent with any of the aligned sequence reads. Our new algorithm uses a dynamic windowing approach. It detects alleles by simultaneously processing the portions of aligned reads spanning a region of sequence variation, assigns reads to their respective alleles, phases adjacent variant alleles and generates a consensus sequence corresponding to each confirmed allele. This algorithm was used to produce the first diploid genome sequence of an individual human. It can also be applied to assemblies of multiple diploid individuals and hybrid assemblies of multiple haploid organisms. Results: Being applied to the individual human genome assembly, the new algorithm detects exactly two confirmed alleles and reports two consensus sequences in 98.98% of the total number 2 033 311 detected regions of sequence variation. In 33 269 out of 460 373 detected regions of size &amp;gt;1 bp, it fixes the constructed errors of a mosaic haploid representation of a diploid locus as produced by the original Celera Assembler consensus algorithm. Using an optimized procedure calibrated against 1 506 344 known SNPs, it detects 438 814 new heterozygous SNPs with false positive rate 12%. Availability: The open source code is available at: http://wgs-assembler.cvs.sourceforge.net/wgs-assembler/ Contact: gdenisov@jcvi.org
DOI: 10.1371/journal.pntd.0000716
2010
Cited 95 times
New Assembly, Reannotation and Analysis of the Entamoeba histolytica Genome Reveal New Genomic Features and Protein Content Information
Background In order to maintain genome information accurately and relevantly, original genome annotations need to be updated and evaluated regularly. Manual reannotation of genomes is important as it can significantly reduce the propagation of errors and consequently diminishes the time spent on mistaken research. For this reason, after five years from the initial submission of the Entamoeba histolytica draft genome publication, we have re-examined the original 23 Mb assembly and the annotation of the predicted genes. Principal Findings The evaluation of the genomic sequence led to the identification of more than one hundred artifactual tandem duplications that were eliminated by re-assembling the genome. The reannotation was done using a combination of manual and automated genome analysis. The new 20 Mb assembly contains 1,496 scaffolds and 8,201 predicted genes, of which 60% are identical to the initial annotation and the remaining 40% underwent structural changes. Functional classification of 60% of the genes was modified based on recent sequence comparisons and new experimental data. We have assigned putative function to 3,788 proteins (46% of the predicted proteome) based on the annotation of predicted gene families, and have identified 58 protein families of five or more members that share no homology with known proteins and thus could be entamoeba specific. Genome analysis also revealed new features such as the presence of segmental duplications of up to 16 kb flanked by inverted repeats, and the tight association of some gene families with transposable elements. Significance This new genome annotation and analysis represents a more refined and accurate blueprint of the pathogen genome, and provides an upgraded tool as reference for the study of many important aspects of E. histolytica biology, such as genome evolution and pathogenesis.
DOI: 10.1186/s12864-017-3971-4
2017
Cited 55 times
Strategies for optimizing BioNano and Dovetail explored through a second reference quality assembly for the legume model, Medicago truncatula
Third generation sequencing technologies, with sequencing reads in the tens- of kilo-bases, facilitate genome assembly by spanning ambiguous regions and improving continuity. This has been critical for plant genomes, which are difficult to assemble due to high repeat content, gene family expansions, segmental and tandem duplications, and polyploidy. Recently, high-throughput mapping and scaffolding strategies have further improved continuity. Together, these long-range technologies enable quality draft assemblies of complex genomes in a cost-effective and timely manner.Here, we present high quality genome assemblies of the model legume plant, Medicago truncatula (R108) using PacBio, Dovetail Chicago (hereafter, Dovetail) and BioNano technologies. To test these technologies for plant genome assembly, we generated five assemblies using all possible combinations and ordering of these three technologies in the R108 assembly. While the BioNano and Dovetail joins overlapped, they also showed complementary gains in continuity and join numbers. Both technologies spanned repetitive regions that PacBio alone was unable to bridge. Combining technologies, particularly Dovetail followed by BioNano, resulted in notable improvements compared to Dovetail or BioNano alone. A combination of PacBio, Dovetail, and BioNano was used to generate a high quality draft assembly of R108, a M. truncatula accession widely used in studies of functional genomics. As a test for the usefulness of the resulting genome sequence, the new R108 assembly was used to pinpoint breakpoints and characterize flanking sequence of a previously identified translocation between chromosomes 4 and 8, identifying more than 22.7 Mb of novel sequence not present in the earlier A17 reference assembly.Adding Dovetail followed by BioNano data yielded complementary improvements in continuity over the original PacBio assembly. This strategy proved efficient and cost-effective for developing a quality draft assembly compared to traditional reference assemblies.
DOI: 10.1186/s12864-017-3927-8
2017
Cited 53 times
Hybrid assembly with long and short reads improves discovery of gene family expansions
Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.We developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation.Compared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies.Our results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.
DOI: 10.12688/f1000research.13635.1
2018
Cited 48 times
A draft genome sequence for the Ixodes scapularis cell line, ISE6
Background: The tick cell line ISE6, derived from Ixodes scapularis, is commonly used for amplification and detection of arboviruses in environmental or clinical samples. Methods: To assist with sequence-based assays, we sequenced the ISE6 genome with single-molecule, long-read technology. Results: The draft assembly appears near complete based on gene content analysis, though it appears to lack some instances of repeats in this highly repetitive genome. The assembly appears to have separated the haplotypes at many loci. DNA short read pairs, used for validation only, mapped to the cell line assembly at a higher rate than they mapped to the Ixodes scapularis reference genome sequence. Conclusions: The assembly could be useful for filtering host genome sequence from sequence data obtained from cells infected with pathogens.
DOI: 10.1093/gigascience/gix135
2018
Cited 44 times
Analysis of the Aedes albopictus C6/36 genome provides insight into cell line utility for viral propagation
The 50-year-old Aedes albopictus C6/36 cell line is a resource for the detection, amplification, and analysis of mosquito-borne viruses including Zika, dengue, and chikungunya. The cell line is derived from an unknown number of larvae from an unspecified strain of Aedes albopictus mosquitoes. Toward improved utility of the cell line for research in virus transmission, we present an annotated assembly of the C6/36 genome. The C6/36 genome assembly has the largest contig N50 (3.3 Mbp) of any mosquito assembly, presents the sequences of both haplotypes for most of the diploid genome, reveals independent null mutations in both alleles of the Dicer locus, and indicates a male-specific genome. Gene annotation was computed with publicly available mosquito transcript sequences. Gene expression data from cell line RNA sequence identified enrichment of growth-related pathways and conspicuous deficiency in aquaporins and inward rectifier K+ channels. As a test of utility, RNA sequence data from Zika-infected cells were mapped to the C6/36 genome and transcriptome assemblies. Host subtraction reduced the data set by 89%, enabling faster characterization of nonhost reads. The C6/36 genome sequence and annotation should enable additional uses of the cell line to study arbovirus vector interactions and interventions aimed at restricting the spread of human disease.
DOI: 10.1101/gr.2889405
2005
Cited 75 times
Gene and alternative splicing annotation with AIR
Designing effective and accurate tools for identifying the functional and structural elements in a genome remains at the frontier of genome annotation owing to incompleteness and inaccuracy of the data, limitations in the computational models, and shifting paradigms in genomics, such as alternative splicing. We present a methodology for the automated annotation of genes and their alternatively spliced mRNA transcripts based on existing cDNA and protein sequence evidence from the same species or projected from a related species using syntenic mapping information. At the core of the method is the splice graph, a compact representation of a gene, its exons, introns, and alternatively spliced isoforms. The putative transcripts are enumerated from the graph and assigned confidence scores based on the strength of sequence evidence, and a subset of the high-scoring candidates are selected and promoted into the annotation. The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets. The process is fast, accurate, and fully automated, and combines the traditionally distinct gene annotation and alternative splicing detection processes in a comprehensive and systematic way, thus considerably aiding in the ensuing manual curation efforts.
DOI: 10.1093/plphys/kiac520
2022
Cited 11 times
Spatial and temporal regulation of parent-of-origin allelic expression in the endosperm
Genomic imprinting promotes differential expression of parental alleles in the endosperm of flowering plants and is regulated by epigenetic modification such as DNA methylation and histone tail modifications in chromatin. After fertilization, the endosperm develops through a syncytial stage before it cellularizes and becomes a nutrient source for the growing embryo. Regional compartmentalization has been shown both in early and late endosperm development, and different transcriptional domains suggest divergent spatial and temporal regional functions. The analysis of the role of parent-of-origin allelic expression in the endosperm as a whole and the investigation of domain-specific functions have been hampered by the inaccessibility of the tissue for high-throughput transcriptome analyses and contamination from surrounding tissue. Here, we used fluorescence-activated nuclear sorting (FANS) of nuclear targeted GFP fluorescent genetic markers to capture parental-specific allelic expression from different developmental stages and specific endosperm domains. This approach allowed us to successfully identify differential genomic imprinting with temporal and spatial resolution. We used a systematic approach to report temporal regulation of imprinted genes in the endosperm, as well as region-specific imprinting in endosperm domains. Analysis of our data identified loci that are spatially differentially imprinted in one domain of the endosperm, while biparentally expressed in other domains. These findings suggest that the regulation of genomic imprinting is dynamic and challenge the canonical mechanisms for genomic imprinting.
DOI: 10.1186/gb-2011-12-3-r31
2011
Cited 34 times
A vertebrate case study of the quality of assemblies derived from next-generation sequences
The unparalleled efficiency of next-generation sequencing (NGS) has prompted widespread adoption, but significant problems remain in the use of NGS data for whole genome assembly. We explore the advantages and disadvantages of chicken genome assemblies generated using a variety of sequencing and assembly methodologies. NGS assemblies are equivalent in some ways to a Sanger-based assembly yet deficient in others. Nonetheless, these assemblies are sufficient for the identification of the majority of genes and can reveal novel sequences when compared to existing assembly references.
DOI: 10.1093/pcp/pcw200
2016
Cited 30 times
ThaleMine: A Warehouse for Arabidopsis Data Integration and Discovery
ThaleMine (https://apps.araport.org/thalemine/) is a comprehensive data warehouse that integrates a wide array of genomic information of the model plant Arabidopsis thaliana. The data collection currently includes the latest structural and functional annotation from the Araport11 update, the Col-0 genome sequence, RNA-seq and array expression, co-expression, protein interactions, homologs, pathways, publications, alleles, germplasm and phenotypes. The data are collected from a wide variety of public resources. Users can browse gene-specific data through Gene Report pages, identify and create gene lists based on experiments or indexed keywords, and run GO enrichment analysis to investigate the biological significance of selected gene sets. Developed by the Arabidopsis Information Portal project (Araport, https://www.araport.org/), ThaleMine uses the InterMine software framework, which builds well-structured data, and provides powerful data query and analysis functionality. The warehoused data can be accessed by users via graphical interfaces, as well as programmatically via web-services. Here we describe recent developments in ThaleMine including new features and extensions, and discuss future improvements. InterMine has been broadly adopted by the model organism research community including nematode, rat, mouse, zebrafish, budding yeast, the modENCODE project, as well as being used for human data. ThaleMine is the first InterMine developed for a plant model. As additional new plant InterMines are developed by the legume and other plant research communities, the potential of cross-organism integrative data analysis will be further enabled.
DOI: 10.1104/pp.19.00320
2019
Cited 24 times
Regulation of Parent-of-Origin Allelic Expression in the Endosperm
Genomic imprinting is an epigenetic phenomenon established in the gametes prior to fertilization that causes differential expression of parental alleles, mainly in the endosperm of flowering plants. The overlap between previously identified panels of imprinted genes is limited. To investigate imprinting, we used high-resolution sequencing data acquired with sequence-capture technology. We present a bioinformatics pipeline to assay parent-of-origin allele-specific expression and report more than 300 loci with parental expression bias in Arabidopsis (Arabidopsis thaliana). In most cases, the level of expression from maternal and paternal alleles was not binary, instead supporting a differential dosage hypothesis for the evolution of imprinting in plants. To address imprinting regulation, we systematically employed mutations in regulative epigenetic pathways suggested to be major players in the process. We established the mechanistic mode of imprinting for more than 50 loci regulated by DNA methylation and Polycomb-dependent histone methylation. However, the imprinting patterns of most genes were not affected by these mechanisms. To this end, we also demonstrated that the RNA-directed DNA methylation pathway alone does not substantially influence imprinting patterns, suggesting that more complex epigenetic pathways regulate most of the identified imprinted genes.
DOI: 10.1111/tpj.16401
2023
Cited 3 times
Structural evidence for <scp>MADS</scp>‐box type I family expansion seen in new assemblies of <i>Arabidopsis arenosa</i> and <i>A. lyrata</i>
Arabidopsis thaliana diverged from A. arenosa and A. lyrata at least 6 million years ago. The three species differ by genome-wide polymorphisms and morphological traits. The species are to a high degree reproductively isolated, but hybridization barriers are incomplete. A special type of hybridization barrier is based on the triploid endosperm of the seed, where embryo lethality is caused by endosperm failure to support the developing embryo. The MADS-box type I family of transcription factors is specifically expressed in the endosperm and has been proposed to play a role in endosperm-based hybridization barriers. The gene family is well known for its high evolutionary duplication rate, as well as being regulated by genomic imprinting. Here we address MADS-box type I gene family evolution and the role of type I genes in the context of hybridization. Using two de-novo assembled and annotated chromosome-level genomes of A. arenosa and A. lyrata ssp. petraea we analyzed the MADS-box type I gene family in Arabidopsis to predict orthologs, copy number, and structural genomic variation related to the type I loci. Our findings were compared to gene expression profiles sampled before and after the transition to endosperm cellularization in order to investigate the involvement of MADS-box type I loci in endosperm-based hybridization barriers. We observed substantial differences in type-I expression in the endosperm of A. arenosa and A. lyrata ssp. petraea, suggesting a genetic cause for the endosperm-based hybridization barrier between A. arenosa and A. lyrata ssp. petraea.
DOI: 10.1186/s12915-017-0413-3
2017
Cited 19 times
Erratum to: Genomic innovations, transcriptional plasticity and gene loss underlying the evolution and divergence of two highly polyphagous and invasive Helicoverpa pest species
Upon publication of the original article [1], it was noticed that Dr Papanicolaou's surname was spelt incorrectly. The correct spelling is Papanicolaou, as shown in the author list of this erratum.
DOI: 10.1186/s12859-024-05728-3
2024
Machine learning on alignment features for parent-of-origin classification of simulated hybrid RNA-seq
Parent-of-origin allele-specific gene expression (ASE) can be detected in interspecies hybrids by virtue of RNA sequence variants between the parental haplotypes. ASE is detectable by differential expression analysis (DEA) applied to the counts of RNA-seq read pairs aligned to parental references, but aligners do not always choose the correct parental reference.We used public data for species that are known to hybridize. We measured our ability to assign RNA-seq read pairs to their proper transcriptome or genome references. We tested software packages that assign each read pair to a reference position and found that they often favored the incorrect species reference. To address this problem, we introduce a post process that extracts alignment features and trains a random forest classifier to choose the better alignment. On each simulated hybrid dataset tested, our machine-learning post-processor achieved higher accuracy than the aligner by itself at choosing the correct parent-of-origin per RNA-seq read pair.For the parent-of-origin classification of RNA-seq, machine learning can improve the accuracy of alignment-based methods. This approach could be useful for enhancing ASE detection in interspecies hybrids, though RNA-seq from real hybrids may present challenges not captured by our simulations. We believe this is the first application of machine learning to this problem domain.
DOI: 10.1186/1471-2105-11-457
2010
Cited 14 times
An algorithm for automated closure during assembly
Finishing is the process of improving the quality and utility of draft genome sequences generated by shotgun sequencing and computational assembly. Finishing can involve targeted sequencing. Finishing reads may be incorporated by manual or automated means. One automated method uses targeted addition by local re-assembly of gap regions. An obvious alternative uses de novo assembly of all the reads. A procedure called the bounding read algorithm was developed for assembly of shotgun reads plus finishing reads and their constraints, targeting repeat regions. The algorithm was implemented within the Celera Assembler software and its pyrosequencing-specific variant, CABOG. The implementation was tested on Sanger and pyrosequencing data from six genomes. The bounding read assemblies were compared to assemblies from two other methods on the same data. The algorithm generates improved assemblies of repeat regions, closing and tiling some gaps while degrading none. The algorithm is useful for small-genome automated finishing projects. Our implementation is available as open-source from http://wgs-assembler.sourceforge.net under the GNU Public License.
DOI: 10.1002/cpe.3542
2015
Cited 9 times
Araport: an application platform for data discovery
Summary Araport is an open‐source, online community resource for research on the Arabidopsis thaliana genome and related data. Araport is developed through a partnership between J. Craig Venter Institute, the Texas Advanced Computing Center at The University of Texas at Austin, and The University of Cambridge. Part of the open architecture of Araport is the Science Applications Workspace. Taking an ‘app store’ approach, users can choose applications developed both by the Araport team and community developers to create a customized environment for their work. Araport also provides tooling and support for developing applications for Araport, including an application generator, a rapid development and testing tool, and a straightforward deployment path for publishing applications into the Araport workspace. Copyright © 2015 John Wiley &amp; Sons, Ltd.
DOI: 10.1016/j.cpm.2022.11.011
2023
Approach to the Ankle in Adult Acquired Flatfoot Deformity
Adult acquired flatfoot is a progressive deformity of the foot and ankle, which frequently becomes increasingly symptomatic. The posterior tibial tendon is most commonly associated with the deformity. A targeted physical examination with plain film radiographs is the recommended initial assessment, which will further guide a physician toward procuring more advanced imaging or toward surgical intervention. In this chapter the authors review the current literature of their approach to the treatment of the ankle in end stage of adult acquired flatfoot deformity.
DOI: 10.1017/9781009010054.020
2023
Archival Data
Social and behavioral researchers often draw on archival data – data collected by an entity other than the research team – to conduct scientific inquiry. Researchers typically seek to make claims about measured variables that extend beyond the measures themselves, such as interpreting a measure as representing an unobservable theoretical construct. Though researchers using archival data encounter many issues, this chapter focuses on two that have received less attention. The first concerns how researchers should justify the interpretations and uses they attach to archival measures. The second concerns how to justify generalizing findings. This chapter provides a framework to help researchers address these issues by drawing on contemporary validity theory in education and psychology as well as theory regarding causal mechanisms from philosophy and sociology. These concepts are illustrated using multiple examples from published studies.
DOI: 10.1101/060921
2016
Cited 7 times
An improved genome assembly uncovers prolific tandem repeats in Atlantic cod
Abstract Background: The first Atlantic cod ( Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies. Results: By combining data from Illumina, 454 and the longer PacBio sequencing technologies, as well as integrating the results of multiple assembly programs, we have created a substantially improved version of the Atlantic cod genome assembly. The sequence contiguity of this assembly is increased fifty-fold and the proportion of gap-bases has been reduced fifteen-fold. Compared to other vertebrates, the assembly contains an unusual high density of tandem repeats (TRs). Indeed, retrospective analyses reveal that gaps in the first genome assembly were largely associated with these TRs. We show that 21 % of the TRs across the assembly, 19 % in the promoter regions and 12 % in the coding sequences are heterozygous in the sequenced individual. Conclusions: The inclusion of PacBio reads combined with the use of multiple assembly programs drastically improved the Atlantic cod genome assembly by successfully resolving long TRs. The high frequency of heterozygous TRs within or in the vicinity of genes in the genome indicate a considerable standing genomic variation in Atlantic cod populations, which is likely of evolutionary importance.
DOI: 10.12688/f1000research.11629.1
2017
Cited 6 times
Initial genome sequencing of the sugarcane CP 96-1252 complex hybrid
<ns4:p>The CP 96-1252 cultivar of sugarcane is a complex hybrid of commercial importance. DNA was extracted from lab-grown leaf tissue and sequenced. The raw Illumina DNA sequencing results provide 101 Gbp of genome sequence reads. The dataset is available from <ns4:ext-link xmlns:ns3="http://www.w3.org/1999/xlink" ext-link-type="uri" ns3:href="https://www.ncbi.nlm.nih.gov/bioproject/PRJNA345486/">https://www.ncbi.nlm.nih.gov/bioproject/PRJNA345486/</ns4:ext-link>.</ns4:p>
DOI: 10.2106/jbjs.17.01461
2018
Cited 4 times
Infographics and Video Summaries Come to JBJS
We are pleased to announce the introduction of exciting new features to JBJS. Starting with our January 2018 volume, two articles in each issue will now have an “infographic” button. Click on it and see a visual representation (or “cartoon”) illustrating the main take-home points of the study (Fig. 1). One of these articles will also have a “video summary” button that will allow you to view a 2 to 3-minute animation providing an engaging, easy-to-digest overview of the key findings (Link to video summary: https://youtu.be/ssDmRk5KM84). These features are functional on both your computer and smartphone.Fig. 1: Infographic for: Andersen MR, Frihagen F, Hellund JC, Madsen JE, Figved W. Randomized trial comparing suture button with single syndesmotic screw for syndesmosis injury. J Bone Joint Surg Am. 2018 Jan 3;100(1):2-12.We at JBJS understand the time crunch of clinical practice, with the increased requirements for documentation and diminishing face-to-face patient time piggybacked onto every clinician’s desire to stay current. We hope that these new features will provide one more tool to help you address this challenge. We also understand researchers’ desire to disseminate their findings to the widest possible audience, which is why we encourage our authors to share the infographic and video summarizing their article on social media and elsewhere. Although we are excited to offer our readers an engaging and convenient way to access scientific information, we know that this is possible only if they can trust that the information is accurate and truly represents the message of the article. Thus, all infographics and videos are created by a team of scientific writers, illustrators, and multimedia designers and undergo a rigorous multistep process involving input and feedback from the author, JBJS copy-editors, and JBJS editors. We hope that you find these new formats useful in your busy professional life. As always, we greatly appreciate feedback from our readers.
DOI: 10.1101/2022.01.29.478178
2022
Spatial and Temporal Regulation of Parent-of-Origin Allelic Expression in the Endosperm
Abstract Genomic imprinting promotes differential expression of parental alleles in the endosperm of flowering plants, and is regulated by epigenetic modification such as DNA methylation and histone tail modifications in chromatin. After fertilization, the endosperm develops through a syncytial stage before it cellularizes and becomes a nutrient source for the growing embryo. Both in early and late endosperm development regional compartmentalization has been shown, and different transcriptional domains suggest divergent spatial and temporal regional functions. The analysis of the role of parent-of-origin allelic expression in the endosperm as a whole and also investigation of domain specific functions has been hampered by the availability of the tissue for high-throughput transcriptome analyses and contamination from surrounding tissue. Here we have used Fluorescence-Activated Nuclear Sorting (FANS) of nuclear targeted eGFP fluorescent genetic markers to capture parental specific allelic expression from different developmental stages and specific endosperm domains. This RNASeq approach allows us to successfully identify differential genomic imprinting with temporal and spatial resolution. In a systematic approach we report temporal regulation of imprinted genes in the endosperm as well as region specific imprinting in endosperm domains. Our data identifies loci that are spatially differentially imprinted in one domain of the endosperm while biparentally expressed in other domains. This suggests that regulation of genomic imprinting is dynamic and challenges the canonical mechanisms for genomic imprinting.
DOI: 10.1089/cmb.2004.11.800
2004
Cited 5 times
ThurGood: Evaluating Assembly-to-Assembly Mapping
The alignment and mapping of large genomic sequences is the focus of much recent research. However, relatively little has been done so far about testing and validating alignment methods. We introduce criteria and new tools we have developed for alignment evaluation. These tools have already proved useful in the evaluation and ranking of several methods for assembly-to-assembly mapping, which were recently used to map multiple versions of the human genome to each other (Istrail et aL, 2004).
DOI: 10.2106/jbjs.16.00643
2016
Introducing JBJS Open Access
Open access (OA) has been gaining momentum in the world of scholarly publication for the past 15 years. An article published with the OA model is freely accessible to the public—i.e., it does not sit behind a publisher’s “pay wall.” This movement has been fueled in large part by governmental funders of research who believe that the published product of their support should be available without charge to the entire research community and to the public who provided the funding. To cover the costs of the editorial/peer-review process in the OA model, authors (or their institution) pay an article processing charge (APC), averaging $2,000 to $5,000. This fee is often included in federal or foundation grants that authors receive to conduct their research. After a lengthy period of discussion by the JBJS Editorial Board and Board of Trustees, we are pleased to announce the launch of JBJS Open Access. This new, online-only journal will expand our ability to publish basic-science and clinical findings as well as new research approaches that have the potential of impacting musculoskeletal disease and injury care worldwide. We believe that this journal option will be of greatest interest to authors from non-North American regions, where the OA funding mandate is quite strong. JBJS Open Access manuscripts will undergo the same rigorous peer review for which JBJS is known. This will be overseen by newly appointed Co-editors Dr. Eng Lee from Singapore and Dr. Robin Richards from Canada, both highly experienced basic and clinical science investigators who possess over 60 years of combined experience in scholarly publication activities. In addition, all accepted papers will undergo the same extensive copy-editing process that is provided for all JBJS publications. The APC for publication in JBJS Open Access has been set at $2,250. This fee will be collected after acceptance and will have no influence on the editorial decision-making process for a manuscript. JBJS is excited to be able to offer JBJS Open Access as one more step in our continuing effort to meet our authors’ and readers’ evolving needs. We are proud that authors will now have an OA publication option while receiving the outstanding services and brand of excellence that JBJS has been providing its authors for over 125 years—and that a wider audience of readers will have access to the best in orthopaedic information.
DOI: 10.1101/2023.05.30.542816
2023
Structural evidence for MADS-box type I family expansion seen in new assemblies of<i>A. arenosa</i>and<i>A. lyrata</i>
Summary Arabidopsis thaliana diverged from A. arenosa and A. lyrata at least 6 million years ago and are identified by genome-wide polymorphisms or morphological traits. The species are to a high degree reproductively isolated, but hybridization barriers are incomplete. A special type of hybridization barrier is based in the triploid endosperm of the seed, where embryo lethality is caused by endosperm failure to support the developing embryo. The MADS-box type I family of transcription factors are specifically expressed in the endosperm and has been proposed to play a role in endosperm-based hybridization barriers. The gene family is well known for a high evolutionary duplication rate, as well as being regulated by genomic imprinting. Here we address MADS-box type I gene family evolution and the role of type I genes in the context of hybridization. Using two de-novo assembled and annotated chromosome-level genomes of A. arenosa and A. lyrata ssp. petraea we analyzed the MADS-box type I gene family in Arabidopsis to predict orthologs, copy number and structural genomic variation related to the type I loci. Our findings were compared to gene expression profiles sampled before and after the transition to endosperm cellularization in order to investigate the involvement of MADS-box type I loci in endosperm-based hybridization barriers. We observed substantial differences in type-I expression between A. arenosa and A. lyrata ssp. petraea in the endosperm, suggesting a genetic cause for the endosperm-based hybridization barrier in A. arenosa and A. lyrata ssp. petraea hybrid seeds.
DOI: 10.55632/pwvas.v95i2.974
2023
Analyzing Lung Cancer Data for Machine Learning
Data preparation is a critical step for any machine learning experiment. We have analyzed a dataset derived from images of human male lung cancer tumors. These tumors had been analyzed with genetic markers to identify Y-chromosome loss, which was the case in about half of the samples. Whole slide images (WSI) had been collected and H&amp;E stained by collaborators. We had processed the images with the CellProfiler software to extract numeric features. In this study, we analyzed the data in preparation for training a convolutional neural network to predict Y-chromosome loss from the extracted features, thereby recapitulating the genetic marker analysis. Using Excel and Python, we identified uninformative features and missing data. We predict that data cleaning, informed by these results, will improve the chances of successful machine learning.
DOI: 10.55632/pwvas.v95i2.973
2023
Facial Image Generation with Limited Training Data
Deep learning models have a wide number of applications including generating realistic-looking images. These models typically require lots of data, but we wanted to explore how much quality is sacrificed by using smaller amounts of data. We built several models and trained them at different dataset sizes, then we assessed the quality of the generated images with the widely used FID measure. As expected, we measured an inverse correlation of -0.7 between image quality and training set size. However, we observed that the small-training-set results had problems not detectable by this experiment. We therefore present an experimental design for a follow-up study that would further explore the lower limits of training set size. These experiments are important for bringing us closer to understanding how much data is needed to train a successful generative model.
DOI: 10.21203/rs.3.rs-3214264/v1
2023
RNA-seq Parent-of-Origin Classification with Machine Learning applied to Alignment Features
Abstract Background Parent-of-origin allele-specific gene expression (ASE) can be detected in interspecies hybrids by virtue of RNA sequence variants between the parental haplotypes. ASE is detectable by differential expression analysis (DEA) applied to the counts of RNA-seq read pairs aligned to parental references, but aligners do not always choose the correct parental reference. Results We used public data from four species pairs that are known to hybridize. For each pair, we obtained RNA-seq read pairs from both species and measured our ability to assign each read to its proper species by comparing reads to the transcriptome or genome references. We tested four software packages that assign each read pair to a reference position and found that they often favored the incorrect species reference. To address this problem, we introduce a post process that extracts alignment features and trains a random forest classifier to choose the better alignment. On each dataset tested, our machine-learning post-processor achieved higher accuracy than the aligner by itself at choosing the correct species per RNA-seq read pair. Conclusions For the parent-of-origin classification of RNA-seq, machine learning can improve the accuracy of alignment-based methods. This approach should be useful for enhancing ASE detection in interspecies hybrids. We believe this is the first application of machine learning to this problem domain.
DOI: 10.1109/bibm58861.2023.10385263
2023
LncRNA Subcellular Localization Signals – Are the Two Ends Equal? A Machine Learning Analysis Across Multiple Cell Lines
In this work, we studied the question of whether the two ends of long non-coding ribonucleic acids (lncRNAs) (i.e., the 5′ end and 3′end) carry similar information about subcellular localization of lncRNAs. We considered this problem from three viewpoints using machine learning models: (1) consideration of the classification performance of the machine learning models using features from defined regions (or segments) along the sequence, (2) correlation-based analysis using models built on regions/segments along the LncRNA sequence, and (3) analysis of the relative positions of predicted lncRNA localization motifs along the LncRNA sequence. Our results and observations suggest that the 5′ region of the lncRNA sequences (the prefixes) tend to carry more localization signals when compared with the 3′ region (the suffixes) of the sequences. These could have implications on how we use machine learning models for improved analysis of lncRNA subcellular localization.
DOI: 10.1002/(sici)1522-2594(199901)41:1<72::aid-mrm11>3.3.co;2-1
1999
Cited 5 times
Performance of a high‐temperature superconducting probe for in vivo microscopy at 2.0 T
The use of a high-temperature superconducting probe for in vivo magnetic resonance microscopy at 2 T is described. To evaluate the performance of the probe, a series of SNR comparisons are carried out. The SNR increased by a factor of 3.7 compared with an equivalent copper coil. Quantitative measures of the SNR gain are in good agreement with theoretical predictions. A number of issues that are unique to the application of HTS coils are examined, including the difficulty in obtaining homogenous excitation without degrading the SNR of the probe. The use of the HTS probe in transmit-receive mode is simple to implement but results in nonuniform excitation. The effect of using the probe in this mode of operation on the T1 and T*2 contrast is investigated. Methods for improving homogeneity are explored, such as employing a transmit volume coil. It is found that the cost of using an external transmit coil is an increased probe noise temperature and a reduced SNR by ∼30%. Other important aspects of the probe are considered, including the effect of temperature on probe stability. Three-dimensional in vivo imaging sets are acquired to assess the stability of the probe for long scans. High-resolution images of the rat brain demonstrate the utility of the probe for microscopy applications. Magn Reson Med 41:72-79, 1999. © 1999 Wiley-Liss, Inc.
DOI: 10.5555/502125.502127
2001
Cited 3 times
Visualization challenges for a new cyber-pharmaceutical computing paradigm
In recent years, an explosion in data has been profoundly changing the field of biology and creating the need for new areas of expertise, particularly in the handling of data. One vital area that has so far received insufficient attention is how to communicate the large quantities of diverse and complex information that is being generated. Celera has encountered a number of visualization problems in the course of developing tools for bioinformatics research, applying them to our data generation efforts, and making that data available to our customers. This paper presents several examples from Celera's experience. In the area of genomics, challenging visualization problems have come up in assembling genomes, studying variations between individuals, and comparing different genomes to one another. The emerging area of proteomics has created new visualization challenges in interpreting protein expression data, studying protein regulatory networks, and examining protein structure. These examples illustrate how the field of bioinformatics is posing new challenges concerning the communication of data that are often very different from those that have heretofore dominated scientific computing. Addressing the level of detail, the degree of complexity, and the interdisciplinary barriers that characterize bioinformatic problems can be expected to be a sizable but rewarding task for the field of scientific visualization.
DOI: 10.1101/157081
2017
Analysis of the <i>Aedes albopictus</i> C6/36 genome provides insight into cell line adaptations to <i>in vitro</i> viral propagation
ABSTRACT Background The 50-year old Aedes albopictus C6/36 cell line is a resource for the detection, amplification, and analysis of mosquito-borne viruses including Zika, dengue, and chikungunya. The cell line is derived from an unknown number of larvae from an unspecified strain of Aedes albopictus mosquitoes. Toward improved utility of the cell line for research in virus transmission, we present an annotated assembly of the C6/36 genome. Results The C6/36 genome assembly has the largest contig N50 (3.3 Mbp) of any mosquito assembly, presents the sequences of both haplotypes for most of the diploid genome, reveals independent null mutations in both alleles of the Dicer locus, and indicates a male-specific genome. Gene annotation was computed with publicly available mosquito transcript sequences. Gene expression data from cell line RNA sequence identified enrichment of growth-related pathways and conspicuous deficiency in aquaporins and inward rectifier K + channels. As a test of utility, RNA sequence data from Zika-infected cells was mapped to the C6/36 genome and transcriptome assemblies. Host subtraction reduced the data set by 89%, enabling faster characterization of non-host reads. Conclusions The C6/36 genome sequence and annotation should enable additional uses of the cell line to study arbovirus vector interactions and interventions aimed at restricting the spread of human disease.
DOI: 10.1073/pnas.1217345109
2012
Correction for Miller et al., Genetic diversity and population structure of the endangered marsupial <i>Sarcophilus harrisii</i> (Tasmanian devil)
1976
Cited 3 times
Reconsideration of a hypothesis.
DOI: 10.1109/bibm49941.2020.9313445
2020
Exploring Neural Network Models for LncRNA Sequence Identification
Distinguishing long non-coding RNA from protein-coding RNA is important to molecular and cellular biology. The problem can be addressed with machine learning in general and with artificial neural networks in particular. We explore the effects of various network design choices on the accuracy of human LncRNA identification. Perceptron-based neural network models were found to be almost as accurate as more complex recurrent neural networks, and K-mer representations of the data seemed to assist both. Size selection of training data affected results. These explorations could assist in neural network design for RNA analysis.
DOI: 10.1016/j.virusres.2021.198545
2021
Novel isoforms of influenza virus PA-X and PB1-F2 indicated by automatic annotation
The influenza A virus genome contains 8 gene segments encoding 10 commonly recognized proteins. Additional protein products have been identified, including PB1-F2 and PA-X. We report the in-silico identification of novel isoforms of PB1-F2 and PA-X in influenza virus genomes sequenced from avian samples. The isoform observed in PA-X includes a mutated stop codon that should extend the protein product by 8 amino acids. The isoform observed in PB1-F2 includes two nonsense mutations that should truncate the N-terminal region of the protein product and remove the entire mitochondrial targeting domain. Both isoforms were uncovered during automatic annotation of CEIRS sequence data. Nominally termed PA-X8 and PB1-F2-Cterm, both predicted isoforms were subsequently found in other annotated influenza genomes previously deposited in GenBank. Both isoforms were noticed due to discrepant annotations output by two annotation engines, indicating a benefit of incorporating multiple algorithms during gene annotation.
DOI: 10.1109/gce.2014.10
2014
The Arabidopsis Information Portal: An Application Platform for Data Discovery
The Arabidopsis Information Portal (AIP) is an open-access online community resource for research on the Arabidopsis thaliana genome and related data. AIP is developed through a partnership between J. Craig Venter Institute, the Texas Advanced Computing Center at The University of Texas at Austin, and The University of Cambridge. Part of the open architecture of AIP is a science applications workspace. Researchers can select applications developed both by the AIP team and the community from an \app store" to construct a customized environment for their work. AIP provides tooling and support for developing applications for AIP including an application generator, interactive development and testing environment, and a straightforward deployment path for deploying an application into the AIP workspace.
2015
Using the Arabidopsis Information Portal
DOI: 10.1007/978-3-662-43883-1_9
2014
Bioinformatics for Genomes and Metagenomes in Ecology Studies
Major technological developments in the field of microbial ecology are redefining the science, moving the focus of research away from studies of individual isolates and species that are studied under carefully controlled conditions in the laboratory, towards the study of entire communities of organisms in their natural environments. Ever more efficient sequencing technologies mean that we can generate huge volumes of sequence data — shifting the cost burden from sequence generation to sequence analysis. The bioinformatic techniques for managing and analyzing both the new types of data and the vastly increased volumes of data are transforming our understanding of life and its interdependencies. These data sets, in conjunction with bioinformatics are enhancing our understanding of microbial diversity and microbial ecology in many different environments. In this chapter, we provide an overview of some of the genomic, metagenomic and informatics approaches currently being used and or being developed for the study of microbial diversity and ecology.
DOI: 10.7490/f1000research.1112727.1
2016
A resource for metabolomics and transcriptomics analysis
DOI: 10.2106/jbjs.oa.16.00003
2016
Introducing JBJS Open Access
Open access (OA) has been gaining momentum in the world of scholarly publication for the past 15 years. An article published with the OA model is freely accessible to the public—i.e., it does not sit behind a publisher’s “pay wall.” This movement has been fueled in large part by governmental funders of research who believe that the published product of their support should be available without charge to the entire research community and to the public who provided the funding. To cover the costs of the editorial/peer-review process in the OA model, authors (or their institution) pay an article processing charge (APC), averaging $2,000 to $5,000. This fee is often included in federal or foundation grants that authors receive to conduct their research. After a lengthy period of discussion by the JBJS Editorial Board and Board of Trustees, we are pleased to announce the launch of JBJS Open Access. This new, online-only journal will expand our ability to publish basic-science and clinical findings as well as new research approaches that have the potential of impacting musculoskeletal disease and injury care worldwide. We believe that this journal option will be of greatest interest to authors from non-North American regions, where the OA funding mandate is quite strong. JBJS Open Access manuscripts will undergo the same rigorous peer review for which JBJS is known. This will be overseen by newly appointed Co-editors Dr. Eng Lee from Singapore and Dr. Robin Richards from Canada, both highly experienced basic and clinical science investigators who possess over 60 years of combined experience in scholarly publication activities. In addition, all accepted papers will undergo the same extensive copy-editing process that is provided for all JBJS publications. The APC for publication in JBJS Open Access has been set at $2,250. This fee will be collected after acceptance and will have no influence on the editorial decision-making process for a manuscript. JBJS is excited to be able to offer JBJS Open Access as one more step in our continuing effort to meet our authors’ and readers’ evolving needs. We are proud that authors will now have an OA publication option while receiving the outstanding services and brand of excellence that JBJS has been providing its authors for over 125 years—and that a wider audience of readers will have access to the best in orthopaedic information.
2013
The high light inducible genes of the marine cyanobacterium Prochlorococcus: A diverse, dynamic, high-copy-number gene family
DOI: 10.1016/s0749-8063(12)01524-1
2012
Contents
DOI: 10.1002/3527600906.mcb.201100041
2012
Microbiomes
Metagenomics, also referred to as environmental or community genomics, has brought about radical changes in the ability to analyze complex microbial communities by direct sampling of their natural habitat. Metagenomics has truly revolutionized biology and medicine, and changed the way in which genomics is studied. To date, many metagenomic studies have been undertaken, with samples from diverse habitats including the oceans, soil, air, human, and animal hosts having been subject to metagenomic examinations. Currently, huge national and international projects, aimed at elucidating the biogeography of microbial communities living within and on the human body, are well underway. The analysis of human microbiome data has brought about a paradigm shift in the present understanding of the role of resident microorganisms in human health and disease, and brings nontraditional areas such as gut ecology to the forefront of personalized medicine. In parallel, rapid technological advances in DNA sequencing methods have reduced the time and costs associated with sequencing while at the same time significantly increasing the data output. As genome sequencing becomes cheaper, it is being applied to sequence complex metagenomes, and large-scale 16S ribosomal DNA sequencing has become far more routine. Today, metagenomics is proving to be a powerful tool, considerably enhancing the present understanding of the extent and role of microbial diversity in their natural habitats, and in many ecologically important environments, with far greater implications on human health and disease. An overview of the current literature, together with details of projects and the state-of-the-art in microbiome studies, are presented in this chapter.
DOI: 10.2106/jbjs.17.00800
2017
Introducing the New JBJS.org
JBJS is excited to introduce the worldwide orthopaedic community to the new JBJS.org website. The website has been the focus for readers to review the current issue, access content not included in print, and perform searches of JBJS content. In our continuous efforts to optimize these experiences, we recognized that our search capabilities were not ideal. JBJS publishes 6 journals: the flagship journal—with which the international community of orthopaedic surgeons is most familiar and which has been continuously published for 129 years—as well as 5 other journals: JBJS Reviews, JBJS Case Connector, JBJS Essential Surgical Techniques, JBJS Journal of Orthopaedics for Physician Assistants, and JBJS Open Access. Previously, it was possible to search only 1 journal at a time. The new JBJS.org allows users to search across all 6 journals as well as webinars and other educational content. We also recognized that readers wish to focus their reading and searches on their individual practice profile. This has become more critical with the move toward increasing subspecialization. The new JBJS website is built to allow users to quickly find, read, and view relevant and timely material in their specific orthopaedic subspecialization. We have also enhanced the capacity of “My JBJS,” where users can store and organize content that they have found and bookmarked. “My JBJS” will have additional personalization capabilities in the future. In addition, the site offers clearly organized direct links to JBJS CME material, and its new search capability unveils JBJS educational content, images, and videos that are related to the user’s search query. A new feature that we have integrated into JBJS.org is Clinical Summaries. These are concise synopses of the current knowledge on common orthopaedic conditions in 9 subspecialty areas. They are authored by experts on each topic and are accompanied by direct links to the most relevant and cited articles published in JBJS and other peer-reviewed orthopaedic and general medical journals. The Clinical Summaries will be updated and expanded in a continuous cycle. We believe that Clinical Summaries represent a uniquely useful and evidence-based contribution to orthopaedic practice and the review process in orthopaedic surgery—and that they will improve patient care and enhance professional satisfaction. We look forward to feedback as the orthopaedic community begins to experience the enhanced search capabilities, additional tools, and new Clinical Summaries offered by JBJS.org.
DOI: 10.6084/m9.figshare.c.3831283_d19
2017
Additional file 8: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Medicago repeat statistics. (XLSX 48Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d17
2017
Additional file 6: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Medicago assembly size statistics. (XLSX 37Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d14
2017
Additional file 3: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Rice alignment statistics from ATAC. (XLSX 41Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d13
2017
Additional file 2: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Rice alignment statistics from Nucmer. (XLSX 43Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d16
2017
Additional file 5: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Rice tandem repeat statistics. (XLSX 41Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d11
2017
Additional file 1: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Rice assembly size statistics. (XLSX 42Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d18
2017
Additional file 7: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Medicago alignment statistics from Nucmer. (XLSX 54Â kb)
DOI: 10.6084/m9.figshare.c.3831283_d15
2017
Additional file 4: of Hybrid assembly with long and short reads improves discovery of gene family expansions
Rice alignment statistics from Quast. (XLSX 36Â kb)
DOI: 10.55632/pwvas.v94i1.894
2022
Carbon Fiber Resin Composite Strength Relative to Orientation
Carbon fiber resin polymer (CFRP) is a laminate of multiple sheets of carbon fiber woven cloth infused with epoxy resin. The combination is hard, strong, and lightweight, making it ideal for aerospace applications. CFRP is not as well characterized as other construction materials such as metal, wood, and plastic, for which datasheets are easy to find in the public domain. To promote the wider use of CFRP, we used a universal testing machine to measure CFRP sample coupons obtained from manufacturers of experimental aerospace vehicles. The tensile strength and elongation of the CFRP coupons were measured with the weave oriented at an angle of 0, 45, and 90 degrees. Results show that CFRP sheets with a weave orientation of 45 degrees is 1.5 times stronger than the weave of 0 degrees, and 1.1 times stronger than the weave of 90 degrees. These results imply that the strongest part of CFRP sheets is diagonal relative to the weave. For future work, we propose to use larger sample sizes, incorporate different thicknesses of sample, and compare CFRP samples to metal samples of similar sizes.
DOI: 10.55632/pwvas.v94i1.895
2022
Machine Learning applied to an RNA Classification Problem
Long RNA sequences can be classified as protein-coding messenger RNA (mRNA) or as long non-coding RNA (lncRNA). RNA classification based on sequence alone is a bioinformatics challenge with potential to enlarge our understanding of the causes of many human diseases. We chose to compare two classification approaches: (1) applications of correlation statistics and visualization tools, and (2) training a convolutional neural network. Both approaches used human RNA sequences from the public GenCode database. We found that the machine learning approach was superior, achieving an accuracy of 87.37%, which is not as high as some published classifiers. These results indicate that machine learning techniques are a more effective solution to the problem of human RNA classification compared to the other techniques tested. For future work, we propose to build, train, and compare other convolutional neural network models for this classification task.
DOI: 10.55632/pwvas.v94i1.867
2022
Emotion classification of human facial images by a neural network
Facial recognition using artificial neural networks is a biometric technology currently being used in fields such as cybersecurity and criminal investigation. We sought to automatically distinguish between an image of a happy human face and an image of a sad human face with predictions that are better than random guesses. We trained a machine learning model (VGG16, a type of convolutional neural network) on a public image dataset of 12,000 human faces. The resulting model predicted the emotion label 93% of time when shown a test set of 200 images that the model had never seen during training. The results show that the VGG16 convolutional neural network learned features from our data that produced an output which was sufficient to train the additional layers of the model to perform our task at 93% accuracy. We suspect this is because VGG16 was already familiar with features that it learned from ImageNet (a separate dataset of over 2 million images). It is currently unknown whether our model can predict labels for new images outside of the FER13 dataset, but preliminary tests show promising results. Our model was trained using a cloud computing service and a relatively small amount of data, indicating that these kinds of results are easily obtainable by all.
DOI: 10.55632/pwvas.v94i1.931
2022
Effect of Dropout on RNA Classification by CNN
Long RNA sequences can be classified as long non-coding RNA (lncRNA) or protein-coding messenger RNA (mRNA). Automatic classification, based on sequence alone, could benefit biology and medical science. We trained and evaluated a convolutional neural network (CNN) to classify human RNA sequences. The CNN incorporated dropout, a technique that restricts the network to a random portion of its neurons during training. Dropout can reduce overfitting, which means relying on irrelevant aspects of the data to “memorize” the training set. We varied the dropout rate during training and measured the accuracy during testing for an RNA classification task. At dropout rates of 0.5, 0.6, 0.7, and 0.8, the CNN test accuracy was 93.15%, 90.50%, 89.95%, and 88.35%. We conclude that dropout rates above 50% did not improve learning. In the future, we hope to measure the effects of other hyperparameters and models for this classification task.
2007
Supporting Online Material for Genome Sequence of Aedes aegypti, a Major Arbovirus Vector
Vishvanath Nene,* Jennifer R. Wortman, Daniel Lawson, Brian Haas, Chinnappa Kodira, Zhijian (Jake) Tu, Brendan Loftus, Zhiyong Xi, Karyn Megy, Manfred Grabherr, Quinghu Ren, Evgeny M. Zdobnov, Neil F. Lobo, Kathryn S. Campbell, Susan E. Brown, Maria F. Bonaldo, Jingsong Zhu, Steven P. Sinkins, David G. Hogenkamp, Paolo Amedo, Peter Arensburger, Peter W. Atkinson, Shelby Bidwell, Jim Biedler, Ewan Birney, Robert V. Bruggner, Javier Costas, Monique R. Coy, Jonathan Crabtree, Matt Crawford, Becky deBruyn, David DeCaprio, Karin Eiglmeier, Eric Eisenstadt, Hamza El-Dorry, William M. Gelbart, Suely L. Gomes, Martin Hammond, Linda I. Hannick, James R. Hogan, Michael H. Holmes, David Jaffe, J. Spencer Johnston, Ryan C. Kennedy, Hean Koo, Saul Kravitz, Evgenia V. Kriventseva, David Kulp, Kurt LaButti, Eduardo Lee, Song Li, Diane D. Lovin, Chunhong Mao, Evan Mauceli, Carlos F. M. Menck, Jason R. Miller, Philip Montgomery, Akio Mori, Ana L. Nascimento, Horacio F. Naveira, Chad Nusbaum, Sinead O’Leary, Joshua Orvis, Mihaela Pertea, Hadi Quesneville, Kyanne R. Reidenbach, Yu-Hui Rogers, Charles W. Roth, Jennifer R. Schneider, Michael Schatz, Martin Shumway, Mario Stanke, Eric O. Stinson, Jose M. C. Tubio, Janice P. VanZee, Sergio VerjovskiAlmeida, Doreen Werner, Owen White, Stefan Wyder, Qiandong Zeng, Qi Zhao, Yongmei Zhao, Catherine A. Hill, Alexander S. Raikhel, Marcelo B. Soares, Dennis L. Knudson, Norman H. Lee, James Galagan, Steven L. Salzberg, Ian T. Paulsen, George Dimopoulos, Frank H. Collins, Bruce Birren, Claire M. Fraser-Liggett, David W. Severson*
DOI: 10.12688/f1000research.13580.1
2018
A host subtraction database for virus discovery in human cell line sequencing data
The human cell lines HepG2, HuH-7, and Jurkat are commonly used for amplification of the RNA viruses present in environmental samples. To assist with assays by RNAseq, we sequenced these cell lines and developed a subtraction database that contains sequences expected in sequence data from uninfected cells. RNAseq data from cell lines infected with Sendai virus were analyzed to test host subtraction. The process of mapping RNAseq reads to our subtraction database vastly reduced the number non-viral reads in the dataset to allow for efficient secondary analyses.
DOI: 10.55632/pwvas.v90i1.354
2018
Solving Ordinary Differential Equations in Java.
Ordinary Differential Equations (ODEs) are equations containing multivariate derivatives of one or more function with respect to a single independent variable. ODEs are components of mathematical models used in biology, chemistry and physics. Solving ODEs can be lengthy and difficult. We investigated the complexity challenges of developing Java programs that solve ODEs within the context of undergraduate study. We evaluated the relative difficulty of writing a program in plain Java, using a Java library specific to ODEs, and solving the equations without programming. This study could help students and instructors find new ways to explore the complexity of ODE problems and generate new methods for solving them with computers.
2018
Augmenting Workplace Wellness Programs with Biometric Monitoring
DOI: 10.12688/f1000research.13580.3
2019
A host subtraction database for virus discovery in human cell line sequencing data
<ns4:p>The human cell lines HepG2, HuH-7, and Jurkat are commonly used for amplification of the RNA viruses present in environmental samples. To assist with assays by RNAseq, we sequenced these cell lines and developed a subtraction database that contains sequences expected in sequence data from uninfected cells. RNAseq data from cell lines infected with Sendai virus were analyzed to test host subtraction. The process of mapping RNAseq reads to our subtraction database vastly reduced the number non-viral reads in the dataset to allow for efficient secondary analyses.</ns4:p>
DOI: 10.12688/f1000research.13580.2
2018
A host subtraction database for virus discovery in human cell line sequencing data
<ns4:p>The human cell lines HepG2, HuH-7, and Jurkat are commonly used for amplification of the RNA viruses present in environmental samples. To assist with assays by RNAseq, we sequenced these cell lines and developed a subtraction database that contains sequences expected in sequence data from uninfected cells. RNAseq data from cell lines infected with Sendai virus were analyzed to test host subtraction. The process of mapping RNAseq reads to our subtraction database vastly reduced the number non-viral reads in the dataset to allow for efficient secondary analyses.</ns4:p>
DOI: 10.55632/pwvas.v91i1.503
2019
Migrating Data from Excel to the Web Using a LAMP Stack
Data held exclusively in a spreadsheet has limited accessibility. Delivering this data via a web application can allow for increased interaction from external users. We sought to expose a dataset collected by the Biology Department at Shepherd University. We speculated that an undergraduate data analytics major could implement a data migration from Excel to the web in the span of one semester. In close collaboration with the stakeholder, a MySQL database was created and normalized, and a prototype was developed using PHP to retrieve records at the request of the user and display the results. Although the application is not yet public, we hope to make it available within the semester. Provided success, we will have demonstrated one way in which research teams can expose their data with minimal investment.
DOI: 10.55632/pwvas.v91i1.605
2019
Creating Python Algorithms to Mimic R Functions from the R Stats Package
While R and Python are the most used languages within Data Analytics, the differences between them are becoming less polarized as it is now possible to complete the same tasks using both. A Python library developed from analyzing original R source code would help users cross-reference between both platforms. To design the Python algorithms, pseudocode was produced from registering the source code of different R stats functions. A Python test case was generated that would input result textfiles from both scripts and strive to output 0 differences to check for compatibility. So far, several R functions were able to be mimicked and there is still continued work being done to expand the scripts. We hope to later integrate the R and Python scripts into a web application to better display the results.
DOI: 10.55632/pwvas.v91i1.559
2019
Bioinformatics in the Sequencing Era
Transcriptomics, or the analysis of gene expression, is enabled by DNA sequencing machines that decode the nucleotide sequences of RNA molecules from living cells. The large volumes of data require analysis by computer. For organisms with a known genome sequence, computers align each RNA sequence to the genome sequence to identify the gene responsible. The aggregate counts per gene are used to rank the genes by expression. The computational challenge becomes harder when any of the genes under study are similar to each other. The challenge is hardest when the study is focused solely on pairs of genes that are nearly identical: the maternal and paternal copies of the same gene. Nevertheless, there is evidence from plant science that flowering plant genes (maternal) compete with pollen genes (paternal) to reduce each other’s gene expression in their seeds. As computer scientists, we built a computational pipeline capable of detecting this phenomenon. We benefited from plant science collaborators who provided insight, validation, and three billion RNA sequences. We exploited bioinformatics technology including noise reduction algorithms, sequence alignment software, parallel computing environments, and statistical analysis software. We believe our pipeline is less sensitive to machine error, and more tolerant of biological variation, than processes employed to date. Interestingly, our project is focused on a question that was not even posed until the era of the Human Genome Project. Thus, as DNA sequencing technology becomes ubiquitous, and more questions become addressable through sequencing, students of computer science and biology can expect to have growing opportunities to collaborate for scientific discovery.
DOI: 10.55632/pwvas.v91i1.504
2019
Bioinformatics Software for the Detection of Allele-specific Gene Expression
Maternal and paternal gene copies are not always expressed in a 50:50 ratio. Transcriptional imbalance, called allele-specific expression (ASE), has been demonstrated in model organisms and may have evolutionary importance. We are developing a bioinformatics pipeline to detect ASE in RNAseq reads from model plants. The pipeline requires RNAseq reads from homozygous parents, plus RNAseq reads from heterozygous offspring. The pipeline uses software for reference-based assembly, transcript alignment, and differential expression analysis. The pipeline outputs lists of genes whose expression ratios differ significantly from the 50:50 null model prediction. We tested the pipeline on genes commonly expressed in seed endosperm tissue using three strains of Arabidopsis thaliana. Previous studies of this system had relied on counting RNAseq reads that contain known, isolated SNPs. On our data, the two methods had correlations between 0.84 and 0.98 across four experiments. In ongoing work, while developing a version in Python, we aim to demonstrate that the new pipeline is actually more accurate than the isolated SNPs method.
DOI: 10.55632/pwvas.v91i1.516
2019
Developing Virtual Reality Software to Graph 3-D Curves with Applications in College Level Mathematics Education.
Virtual Reality (VR) is a potentially augmentative technology for classroom learning. We speculate that 3D visualization of solutions to mathematical equations would be assistive in a college level calculus course. To test our hypothesis, we built a VR system using commercially available hardware and software. We implemented visualizations of solutions to equations from textbooks used in calculus classes at Shepherd University, where we are in the process of demonstrating the system to mathematics professors and recording their reactions using a survey. So far, three of three professors reported that they would support the system’s use in the classroom. We hope to add features such as projections of curves onto any plane, and we hope to test the system in a classroom setting.
DOI: 10.55632/pwvas.v90i1.352
2018
Evaluation of Shortest Path Algorithms for Solving Mazes.
Finding the optimal path through a maze is related to the shortest path problem in computer science. The first shortest path algorithm was introduced by Edsger W. Dijkstra in 1956 but multiple other solutions and optimizations have been suggested since then. We tested several algorithms for their applicability to the problem of solving mazes. We programmatically generated random 2D square mazes with independent variables for size and bias, where bias measures the deviation from random probability that a left or right turn contributes to a shorter path. We programmatically transformed each maze into a directed graph with arbitrary start and end positions and then solved for the shortest path through each graph. We speculated that maze bias, which tends to alter the graph density, would affect algorithm efficiency. We detected bias-dependent differences in resource usage between standard implementations of three shortest-path algorithms: Dijkstra’s, Bellman-Ford, and A*. Our results could help guide algorithm selection for robotic exploration of specific terrains given prior knowledge of their path properties. This project was supported by the NASA WV Space Grant Consortium.
2004
rhythm:sequence:interruption
DOI: 10.55632/pwvas.v92i1.720
2020
Efficient Simulation of Allele-Specific Expression.
Diploid organisms such as animals and plants carry maternal and paternal variants of most of their genes. Preferential transcription of either gene variant is called ASE for allele-specific expression. In plant seeds, ASE has been observed at selective genes at selective developmental stages, so the process is presumably regulated by epigenetic factors such as genomic imprinting. The Informative Reads Pipeline (IRP) is software that we developed previously for the purpose of detecting ASE in RNA sequencing data obtained from plant seeds. To help us validate and generalize the software, we developed a sequence data simulator that harbors a parameterized model of ASE. Whereas the maternal/paternal ratio per gene is always unknown in real data, the simulator provides the opportunity to quantify IRP’s ability to recover the preset ratios from the data provided. The simulator generates and maps sequences using standard software. Simulating ASE at all combinations of all genes would be computationally prohibitive. Therefore, we introduced an optimization that reduces the generate+map computation from exponential to constant time. Correctness of the optimized simulator is demonstrated here.
DOI: 10.55632/pwvas.v92i1.737
2020
Forecasting California Housing Prices Using a Linear Regression Model
With the prevalence of big data, we are able to implement simple Machine Learning techniques against datasets to solve many present issues. Utilizing the California Housing Prices dataset, we can apply linear regression models to forecast districts’ future median housing prices. As a result, homebuyers could predict if they are getting a good deal on their home, investors could predict their potential return on investments and capitalize on undervalued properties, and sellers could gauge what they could potentially sell for on the open market. To find a starting base, our research began as an independent study guided by a textbook that provides hands-on machine learning training, including one that uses the California Housing Prices dataset. From here, we are able to choose the Least Square Regression Line as our performance measure in the model—which will result in a linear equation that looks for minimizing the variance between the training set’s predictions and the actual points produced on the regression line. Once we are able to train our model to make accurate predictions from the testing set, we can expand upon this research by moving beyond linear regression and investigating different Machine Learning techniques further within the textbook to model the data in more sophisticated ways.
2000
MCSD Visual C++ 6 distributed applications study guide (Exam 70-015)