ϟ

Chad Nusbaum

Here are all the papers by Chad Nusbaum that you can download and read on OA.mg.
Chad Nusbaum’s last known institution is . Download Chad Nusbaum PDFs here.

Claim this Profile →

DOI: 10.1038/35057062

Cited 21,729 times

Initial sequencing and analysis of the human genome

The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

DOI: 10.1038/nbt.1883

Cited 16,716 times

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Massively parallel sequencing of cDNA has enabled deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here we present the Trinity method for de novo assembly of full-length transcripts and evaluate it on samples from fission yeast, mouse and whitefly, whose reference genome is not yet available. By efficiently constructing and analyzing sets of de Bruijn graphs, Trinity fully reconstructs a large fraction of transcripts, including alternatively spliced isoforms and transcripts from recently duplicated genes. Compared with other de novo transcriptome assemblers, Trinity recovers more full-length transcripts across a broad range of expression levels, with a sensitivity similar to methods that rely on genome alignments. Our approach provides a unified solution for transcriptome reconstruction in any sample, especially in the absence of a reference genome.

DOI: 10.1186/gb-2008-9-9-r137

Cited 13,741 times

Model-based Analysis of ChIP-Seq (MACS)

We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, and is freely available.

DOI: 10.1038/nature06008

Cited 3,771 times

Genome-wide maps of chromatin state in pluripotent and lineage-committed cells

We report the application of single-molecule-based sequencing technology for high-throughput profiling of histone modifications in mammalian cells. By obtaining over four billion bases of sequence from chromatin immunoprecipitated DNA, we generated genome-wide chromatin-state maps of mouse embryonic stem cells, neural progenitor cells and embryonic fibroblasts. We find that lysine 4 and lysine 27 trimethylation effectively discriminates genes that are expressed, poised for expression, or stably repressed, and therefore reflect cell state and lineage potential. Lysine 36 trimethylation marks primary coding and non-coding transcripts, facilitating gene annotation. Trimethylation of lysine 9 and lysine 20 is detected at satellite, telomeric and active long-terminal repeats, and can spread into proximal unique sequences. Lysine 4 and lysine 9 trimethylation marks imprinting control regions. Finally, we show that chromatin state can be read in an allele-specific manner by using single nucleotide polymorphisms. This study provides a framework for the application of comprehensive chromatin profiling towards characterization of diverse mammalian cell populations.

DOI: 10.1126/science.1188021

Cited 3,691 times

A Draft Sequence of the Neandertal Genome

Neandertals, the closest evolutionary relatives of present-day humans, lived in large parts of Europe and western Asia before disappearing 30,000 years ago. We present a draft sequence of the Neandertal genome composed of more than 4 billion nucleotides from three individuals. Comparisons of the Neandertal genome to the genomes of five present-day humans from different parts of the world identify a number of genomic regions that may have been affected by positive selection in ancestral modern humans, including genes involved in metabolism and in cognitive and skeletal development. We show that Neandertals shared more genetic variants with present-day humans in Eurasia than with present-day humans in sub-Saharan Africa, suggesting that gene flow from Neandertals into the ancestors of non-Africans occurred before the divergence of Eurasian groups from each other.

DOI: 10.1038/nature07107

Cited 2,275 times

Genome-scale DNA methylation maps of pluripotent and differentiated cells

DNA methylation is essential for normal development and has been implicated in many pathologies including cancer. Our knowledge about the genome-wide distribution of DNA methylation, how it changes during cellular differentiation and how it relates to histone methylation and other chromatin modifications in mammals remains limited. Here we report the generation and analysis of genome-scale DNA methylation profiles at nucleotide resolution in mammalian cells. Using high-throughput reduced representation bisulphite sequencing and single-molecule-based sequencing, we generated DNA methylation maps covering most CpG islands, and a representative sampling of conserved non-coding elements, transposons and other genomic features, for mouse embryonic stem cells, embryonic-stem-cell-derived and primary neural cells, and eight other primary tissues. Several key findings emerge from the data. First, DNA methylation patterns are better correlated with histone methylation patterns than with the underlying genome sequence context. Second, methylation of CpGs are dynamic epigenetic marks that undergo extensive changes during cellular differentiation, particularly in regulatory regions outside of core promoters. Third, analysis of embryonic-stem-cell-derived and primary cells reveals that 'weak' CpG islands associated with a specific set of developmentally regulated genes undergo aberrant hypermethylation during extended proliferation in vitro, in a pattern reminiscent of that reported in some primary tumours. More generally, the results establish reduced representation bisulphite sequencing as a powerful technology for epigenetic profiling of cell populations relevant to developmental biology, cancer and regenerative medicine.

DOI: 10.1126/science.280.5366.1077

Cited 2,008 times

Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome

Single-nucleotide polymorphisms (SNPs) are the most frequent type of variation in the human genome, and they provide powerful tools for a variety of medical genetic studies. In a large-scale survey for SNPs, 2.3 megabases of human genomic DNA was examined by a combination of gel-based sequencing and high-density variation-detection DNA chips. A total of 3241 candidate SNPs were identified. A genetic map was constructed showing the location of 2227 of these SNPs. Prototype genotyping chips were developed that allow simultaneous genotyping of 500 SNPs. The results provide a characterization of human diversity at the nucleotide level and demonstrate the feasibility of large-scale identification of human SNPs.

DOI: 10.1038/nature03025

Cited 1,824 times

Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype

Tetraodon nigroviridis is a freshwater puffer fish with the smallest known vertebrate genome. Here, we report a draft genome sequence with long-range linkage and substantial anchoring to the 21 Tetraodon chromosomes. Genome analysis provides a greatly improved fish gene catalogue, including identifying key genes previously thought to be absent in fish. Comparison with other vertebrates and a urochordate indicates that fish proteins have diverged markedly faster than their mammalian homologues. Comparison with the human genome suggests approximately 900 previously unannotated human genes. Analysis of the Tetraodon and human genomes shows that whole-genome duplication occurred in the teleost fish lineage, subsequent to its divergence from mammals. The analysis also makes it possible to infer the basic structure of the ancestral bony vertebrate genome, which was composed of 12 chromosomes, and to reconstruct much of the evolutionary history of ancient and recent chromosome rearrangements leading to the modern human karyotype.

DOI: 10.1038/nature01554

Cited 1,560 times

The genome sequence of the filamentous fungus Neurospora crassa

Neurospora crassa is a central organism in the history of twentieth-century genetics, biochemistry and molecular biology. Here, we report a high-quality draft sequence of the N. crassa genome. The approximately 40-megabase genome encodes about 10,000 protein-coding genes--more than twice as many as in the fission yeast Schizosaccharomyces pombe and only about 25% fewer than in the fruitfly Drosophila melanogaster. Analysis of the gene set yields insights into unexpected aspects of Neurospora biology including the identification of genes potentially associated with red light photobiology, genes implicated in secondary metabolism, and important differences in Ca2+ signalling as compared with plants and animals. Neurospora possesses the widest array of genome defence mechanisms known for any eukaryotic organism, including a process unique to fungi called repeat-induced point mutation (RIP). Genome analysis suggests that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes.

DOI: 10.1073/pnas.1017351108

Cited 1,491 times

High-quality draft assemblies of mammalian genomes from massively parallel sequence data

Massively parallel DNA sequencing technologies are revolutionizing genomics by making it possible to generate billions of relatively short (~100-base) sequence reads at very low cost. Whereas such data can be readily used for a wide range of biomedical applications, it has proven difficult to use them to generate high-quality de novo genome assemblies of large, repeat-rich vertebrate genomes. To date, the genome assemblies generated from such data have fallen far short of those obtained with the older (but much more expensive) capillary-based sequencing approach. Here, we report the development of an algorithm for genome assembly, ALLPATHS-LG, and its application to massively parallel DNA sequence data from the human and mouse genomes, generated on the Illumina platform. The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. In particular, the base accuracy is high (≥99.95%) and the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing. The combination of improved sequencing technology and improved computational methods should now make it possible to increase dramatically the de novo sequencing of large genomes. The ALLPATHS-LG program is available at http://www.broadinstitute.org/science/programs/genome-biology/crd.

DOI: 10.1038/nature03449

Cited 1,480 times

The genome sequence of the rice blast fungus Magnaporthe grisea

Magnaporthe grisea is the most destructive pathogen of rice worldwide and the principal model organism for elucidating the molecular basis of fungal disease of plants. Here, we report the draft sequence of the M. grisea genome. Analysis of the gene set provides an insight into the adaptations required by a fungus to cause disease. The genome encodes a large and diverse set of secreted proteins, including those defined by unusual carbohydrate-binding domains. This fungus also possesses an expanded family of G-protein-coupled receptors, several new virulence-associated genes and large suites of enzymes involved in secondary metabolism. Consistent with a role in fungal pathogenesis, the expression of several of these genes is upregulated during the early stages of infection-related development. The M. grisea genome has been subject to invasion and proliferation of active transposable elements, reflecting the clonal nature of this fungus imposed by widespread rice cultivation.

DOI: 10.1038/nature08358

Cited 1,318 times

Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans

The genome of Phytophthora infestans, the pathogen that triggered the Irish potato famine in the nineteenth century, has been sequenced. It remains a devastating pathogen, with late blight destroying crops worth billions of dollars each year. Blight is difficult to control, in part because it adapts so quickly to genetically resistant potato strains. Comparison with two other Phytophthora genomes shows rapid turnover and extensive expansion of specific families of secreted disease effector proteins, including many genes induced during infection that have activities thought to alter host physiology. These fast evolving effector genes are found in highly dynamic and expanded regions of the genome, a factor that may contribute to its rapid adaptability to host plants. The P. infestans genome is the biggest so far sequenced, at about 240 megabases, with an extremely high repeat content of close to 75%. It is a model organism for the oomycetes, a distinct lineage of fungus-like eukaryotes related to organisms such as brown algae and diatoms. Phytophthora infestans is a fungus-like eukaryote and the most destructive pathogen of potato, with current annual worldwide potato crop losses due to late blight estimated at $6.7 billion. Here, the sequence of the P. infestans genome is reported. Comparison with two other Phytophthora genomes showed rapid turnover and extensive expansion of certain secreted disease effector proteins, probably explaining the rapid adaptability of the pathogen to host plants. Phytophthora infestans is the most destructive pathogen of potato and a model organism for the oomycetes, a distinct lineage of fungus-like eukaryotes that are related to organisms such as brown algae and diatoms. As the agent of the Irish potato famine in the mid-nineteenth century, P. infestans has had a tremendous effect on human history, resulting in famine and population displacement1. To this day, it affects world agriculture by causing the most destructive disease of potato, the fourth largest food crop and a critical alternative to the major cereal crops for feeding the world’s population1. Current annual worldwide potato crop losses due to late blight are conservatively estimated at $6.7 billion2. Management of this devastating pathogen is challenged by its remarkable speed of adaptation to control strategies such as genetically resistant cultivars3,4. Here we report the sequence of the P. infestans genome, which at ∼240 megabases (Mb) is by far the largest and most complex genome sequenced so far in the chromalveolates. Its expansion results from a proliferation of repetitive DNA accounting for ∼74% of the genome. Comparison with two other Phytophthora genomes showed rapid turnover and extensive expansion of specific families of secreted disease effector proteins, including many genes that are induced during infection or are predicted to have activities that alter host physiology. These fast-evolving effector genes are localized to highly dynamic and expanded regions of the P. infestans genome. This probably plays a crucial part in the rapid adaptability of the pathogen to host plants and underpins its evolutionary potential.

DOI: 10.1038/nbt.1523

Cited 1,278 times

Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing

Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA 'baits' to fish targets out of a 'pond' of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that approximately 60% of target bases in the exonic 'catch', and approximately 80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.

DOI: 10.1038/nbt.1633

Cited 1,186 times

Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs

Massively parallel cDNA sequencing (RNA-Seq) provides an unbiased way to study a transcriptome, including both coding and noncoding genes. Until now, most RNA-Seq studies have depended crucially on existing annotations and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We applied it to mouse embryonic stem cells, neuronal precursor cells and lung fibroblasts to accurately reconstruct the full-length gene structures for most known expressed genes. We identified substantial variation in protein coding genes, including thousands of novel 5' start sites, 3' ends and internal coding exons. We then determined the gene structures of more than a thousand large intergenic noncoding RNA (lincRNA) and antisense loci. Our results open the way to direct experimental manipulation of thousands of noncoding RNAs and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.

DOI: 10.1126/science.1259657

Cited 1,135 times

Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak

Evolution of Ebola virus over time The high rate of mortality in the current Ebola epidemic has made it difficult for researchers to collect samples of the virus and study its evolution. Gire et al. describe Ebola epidemiology on the basis of 99 whole-genome sequences, including samples from 78 affected individuals. The authors analyzed changes in the viral sequence and conclude that the current outbreak probably resulted from the spread of the virus from central Africa in the past decade. The outbreak started from a single transmission event from an unknown animal reservoir into the human population. Two viral lineages from Guinea then spread from person to person into Sierra Leone. Science , this issue p. 1369

DOI: 10.1101/gr.5571506

Cited 1,097 times

Chromosome Conformation Capture Carbon Copy (5C): A massively parallel solution for mapping interactions between genomic elements

Physical interactions between genetic elements located throughout the genome play important roles in gene regulation and can be identified with the Chromosome Conformation Capture (3C) methodology. 3C converts physical chromatin interactions into specific ligation products, which are quantified individually by PCR. Here we present a high-throughput 3C approach, 3C-Carbon Copy (5C), that employs microarrays or quantitative DNA sequencing using 454-technology as detection methods. We applied 5C to analyze a 400-kb region containing the human beta-globin locus and a 100-kb conserved gene desert region. We validated 5C by detection of several previously identified looping interactions in the beta-globin locus. We also identified a new looping interaction in K562 cells between the beta-globin Locus Control Region and the gamma-beta-globin intergenic region. Interestingly, this region has been implicated in the control of developmental globin gene switching. 5C should be widely applicable for large-scale mapping of cis- and trans- interaction networks of genomic elements and for the study of higher-order chromosome structure.

DOI: 10.1038/nature05248

Cited 1,086 times

Insights from the genome of the biotrophic fungal plant pathogen Ustilago maydis

Ustilago maydis is a ubiquitous pathogen of maize and a well-established model organism for the study of plant-microbe interactions. This basidiomycete fungus does not use aggressive virulence strategies to kill its host. U. maydis belongs to the group of biotrophic parasites (the smuts) that depend on living tissue for proliferation and development. Here we report the genome sequence for a member of this economically important group of biotrophic fungi. The 20.5-million-base U. maydis genome assembly contains 6,902 predicted protein-encoding genes and lacks pathogenicity signatures found in the genomes of aggressive pathogenic fungi, for example a battery of cell-wall-degrading enzymes. However, we detected unexpected genomic features responsible for the pathogenicity of this organism. Specifically, we found 12 clusters of genes encoding small secreted proteins with unknown function. A significant fraction of these genes exists in small gene families. Expression analysis showed that most of the genes contained in these clusters are regulated together and induced in infected tissue. Deletion of individual clusters altered the virulence of U. maydis in five cases, ranging from a complete lack of symptoms to hypervirulence. Despite years of research into the mechanism of pathogenicity in U. maydis, no 'true' virulence factors had been previously identified. Thus, the discovery of the secreted protein gene clusters and the functional demonstration of their decisive role in the infection process illuminate previously unknown mechanisms of pathogenicity operating in biotrophic fungi. Genomic analysis is, similarly, likely to open up new avenues for the discovery of virulence determinants in other pathogens.

DOI: 10.1126/science.1138878

Cited 1,060 times

Genome Sequence of <i>Aedes aegypti</i> , a Major Arbovirus Vector

We present a draft sequence of the genome of Aedes aegypti , the primary vector for yellow fever and dengue fever, which at ∼1376 million base pairs is about 5 times the size of the genome of the malaria vector Anopheles gambiae . Nearly 50% of the Ae. aegypti genome consists of transposable elements. These contribute to a factor of ∼4 to 6 increase in average gene length and in sizes of intergenic regions relative to An. gambiae and Drosophila melanogaster . Nonetheless, chromosomal synteny is generally maintained among all three insects, although conservation of orthologous gene order is higher (by a factor of ∼2) between the mosquito species than between either of them and the fruit fly. An increase in genes encoding odorant binding, cytochrome P450, and cuticle domains relative to An. gambiae suggests that members of these protein families underpin some of the biological differences between the two mosquito species.

DOI: 10.1186/gb-2011-12-2-r18

Cited 1,000 times

Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries

Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.

DOI: 10.1016/j.cell.2006.10.040

Cited 939 times

Large-Scale Sequencing Reveals 21U-RNAs and Additional MicroRNAs and Endogenous siRNAs in C. elegans

We sequenced approximately 400,000 small RNAs from Caenorhabditis elegans. Another 18 microRNA (miRNA) genes were identified, thereby extending to 112 our tally of confidently identified miRNA genes in C. elegans. Also observed were thousands of endogenous siRNAs generated by RNA-directed RNA polymerases acting preferentially on transcripts associated with spermatogenesis and transposons. In addition, a third class of nematode small RNAs, called 21U-RNAs, was discovered. 21U-RNAs are precisely 21 nucleotides long, begin with a uridine 5'-monophosphate but are diverse in their remaining 20 nucleotides, and appear modified at their 3'-terminal ribose. 21U-RNAs originate from more than 5700 genomic loci dispersed in two broad regions of chromosome IV-primarily between protein-coding genes or within their introns. These loci share a large upstream motif that enables accurate prediction of additional 21U-RNAs. The motif is conserved in other nematodes, presumably because of its importance for producing these diverse, autonomously expressed, small RNAs (dasRNAs).

DOI: 10.1371/journal.pgen.1000242

Cited 922 times

Genomewide Analysis of PRC1 and PRC2 Occupancy Identifies Two Classes of Bivalent Domains

In embryonic stem (ES) cells, bivalent chromatin domains with overlapping repressive (H3 lysine 27 tri-methylation) and activating (H3 lysine 4 tri-methylation) histone modifications mark the promoters of more than 2,000 genes. To gain insight into the structure and function of bivalent domains, we mapped key histone modifications and subunits of Polycomb-repressive complexes 1 and 2 (PRC1 and PRC2) genomewide in human and mouse ES cells by chromatin immunoprecipitation, followed by ultra high-throughput sequencing. We find that bivalent domains can be segregated into two classes—the first occupied by both PRC2 and PRC1 (PRC1-positive) and the second specifically bound by PRC2 (PRC2-only). PRC1-positive bivalent domains appear functionally distinct as they more efficiently retain lysine 27 tri-methylation upon differentiation, show stringent conservation of chromatin state, and associate with an overwhelming number of developmental regulator gene promoters. We also used computational genomics to search for sequence determinants of Polycomb binding. This analysis revealed that the genomewide locations of PRC2 and PRC1 can be largely predicted from the locations, sizes, and underlying motif contents of CpG islands. We propose that large CpG islands depleted of activating motifs confer epigenetic memory by recruiting the full repertoire of Polycomb complexes in pluripotent cells.

DOI: 10.1101/gr.7337908

Cited 799 times

ALLPATHS: De novo assembly of whole-genome shotgun microreads

New DNA sequencing technologies deliver data at dramatically lower costs but demand new analytical methods to take full advantage of the very short reads that they produce. We provide an initial, theoretical solution to the challenge of de novo assembly from whole-genome shotgun "microreads." For 11 genomes of sizes up to 39 Mb, we generated high-quality assemblies from 80x coverage by paired 30-base simulated reads modeled after real Illumina-Solexa reads. The bacterial genomes of Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, and larger genomes yield assemblies that are highly connected and accurate. Assemblies are presented in a graph form that retains intrinsic ambiguities such as those arising from polymorphism, thereby providing information that has been absent from previous genome assemblies. For both C. jejuni and E. coli, this assembly graph is a single edge encompassing the entire genome. Larger genomes produce more complicated graphs, but the vast majority of the bases in their assemblies are present in long edges that are nearly always perfect. We describe a general method for genome assembly that can be applied to all types of DNA sequence data, not only short read data, but also conventional sequence reads.

DOI: 10.1038/s41590-019-0378-1

Cited 779 times

Defining inflammatory cell states in rheumatoid arthritis joint synovial tissues by integrating single-cell transcriptomics and mass cytometry

To define the cell populations that drive joint inflammation in rheumatoid arthritis (RA), we applied single-cell RNA sequencing (scRNA-seq), mass cytometry, bulk RNA sequencing (RNA-seq) and flow cytometry to T cells, B cells, monocytes, and fibroblasts from 51 samples of synovial tissue from patients with RA or osteoarthritis (OA). Utilizing an integrated strategy based on canonical correlation analysis of 5,265 scRNA-seq profiles, we identified 18 unique cell populations. Combining mass cytometry and transcriptomics revealed cell states expanded in RA synovia: THY1(CD90)+HLA-DRAhi sublining fibroblasts, IL1B+ pro-inflammatory monocytes, ITGAX+TBX21+ autoimmune-associated B cells and PDCD1+ peripheral helper T (TPH) cells and follicular helper T (TFH) cells. We defined distinct subsets of CD8+ T cells characterized by GZMK+, GZMB+, and GNLY+ phenotypes. We mapped inflammatory mediators to their source cell populations; for example, we attributed IL6 expression to THY1+HLA-DRAhi fibroblasts and IL1B production to pro-inflammatory monocytes. These populations are potentially key mediators of RA pathogenesis.

DOI: 10.1186/gb-2013-14-5-r51

Cited 756 times

Characterizing and measuring bias in sequence data

DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias.We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage.The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.

DOI: 10.1101/gad.1884710

Cited 744 times

Mammalian microRNAs: experimental evaluation of novel and previously annotated genes

MicroRNAs (miRNAs) are small regulatory RNAs that derive from distinctive hairpin transcripts. To learn more about the miRNAs of mammals, we sequenced 60 million small RNAs from mouse brain, ovary, testes, embryonic stem cells, three embryonic stages, and whole newborns. Analysis of these sequences confirmed 398 annotated miRNA genes and identified 108 novel miRNA genes. More than 150 previously annotated miRNAs and hundreds of candidates failed to yield sequenced RNAs with miRNA-like features. Ectopically expressing these previously proposed miRNA hairpins also did not yield small RNAs, whereas ectopically expressing the confirmed and newly identified hairpins usually did yield small RNAs with the classical miRNA features, including dependence on the Drosha endonuclease for processing. These experiments, which suggest that previous estimates of conserved mammalian miRNAs were inflated, provide a substantially revised list of confidently identified murine miRNAs from which to infer the general features of mammalian miRNAs. Our analyses also revealed new aspects of miRNA biogenesis and modification, including tissue-specific strand preferences, sequential Dicer cleavage of a metazoan precursor miRNA (pre-miRNA), consequential 5′ heterogeneity, newly identified instances of miRNA editing, and evidence for widespread pre-miRNA uridylation reminiscent of miRNA regulation by Lin28.

DOI: 10.1038/nmeth.1491

Cited 660 times

Comprehensive comparative analysis of strand-specific RNA sequencing methods

Strand-specific, massively parallel cDNA sequencing (RNA-seq) is a powerful tool for transcript discovery, genome annotation and expression profiling. There are multiple published methods for strand-specific RNA-seq, but no consensus exists as to how to choose between them. Here we developed a comprehensive computational pipeline to compare library quality metrics from any RNA-seq method. Using the well-annotated Saccharomyces cerevisiae transcriptome as a benchmark, we compared seven library-construction protocols, including both published and our own methods. We found marked differences in strand specificity, library complexity, evenness and continuity of coverage, agreement with known annotations and accuracy for expression profiling. Weighing each method's performance and ease, we identified the dUTP second-strand marking and the Illumina RNA ligation methods as the leading protocols, with the former benefitting from the current availability of paired-end sequencing. Our analysis provides a comprehensive benchmark, and our computational pipeline is applicable for assessment of future protocols in other organisms.

DOI: 10.1101/gr.223902

Cited 580 times

The Genome of <i>M. acetivorans</i> Reveals Extensive Metabolic and Physiological Diversity

Methanogenesis, the biological production of methane, plays a pivotal role in the global carbon cycle and contributes significantly to global warming. The majority of methane in nature is derived from acetate. Here we report the complete genome sequence of an acetate-utilizing methanogen, Methanosarcina acetivorans C2A. Methanosarcineae are the most metabolically diverse methanogens, thrive in a broad range of environments, and are unique among the Archaea in forming complex multicellular structures. This diversity is reflected in the genome of M. acetivorans. At 5,751,492 base pairs it is by far the largest known archaeal genome. The 4524 open reading frames code for a strikingly wide and unanticipated variety of metabolic and cellular capabilities. The presence of novel methyltransferases indicates the likelihood of undiscovered natural energy sources for methanogenesis, whereas the presence of single-subunit carbon monoxide dehydrogenases raises the possibility of nonmethanogenic growth. Although motility has not been observed in any Methanosarcineae, a flagellin gene cluster and two complete chemotaxis gene clusters were identified. The availability of genetic methods, coupled with its physiological and metabolic diversity, makes M. acetivorans a powerful model organism for the study of archaeal biology. [Sequence, data, annotations and analyses are available at http://www-genome.wi.mit.edu/.]

DOI: 10.1186/gb-2011-12-1-r1

Cited 557 times

A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries

Genome targeting methods enable cost-effective capture of specific subsets of the genome for sequencing. We present here an automated, highly scalable method for carrying out the Solution Hybrid Selection capture approach that provides a dramatic increase in scale and throughput of sequence-ready libraries produced. Significant process improvements and a series of in-process quality control checkpoints are also added. These process improvements can also be used in a manual version of the protocol.

DOI: 10.1016/j.immuni.2015.04.019

Cited 531 times

Cervicovaginal Bacteria Are a Major Modulator of Host Inflammatory Responses in the Female Genital Tract

Colonization by Lactobacillus in the female genital tract is thought to be critical for maintaining genital health. However, little is known about how genital microbiota influence host immune function and modulate disease susceptibility. We studied a cohort of asymptomatic young South African women and found that the majority of participants had genital communities with low Lactobacillus abundance and high ecological diversity. High-diversity communities strongly correlated with genital pro-inflammatory cytokine concentrations in both cross-sectional and longitudinal analyses. Transcriptional profiling suggested that genital antigen-presenting cells sense gram-negative bacterial products in situ via Toll-like receptor 4 signaling, contributing to genital inflammation through activation of the NF-κB signaling pathway and recruitment of lymphocytes by chemokine production. Our study proposes a mechanism by which cervicovaginal microbiota impact genital inflammation and thereby might affect a woman's reproductive health, including her risk of acquiring HIV.

DOI: 10.1038/nbt.1861

Cited 524 times

Metabolic labeling of RNA uncovers principles of RNA production and degradation dynamics in mammalian cells

Cellular RNA levels are determined by the interplay of RNA production, processing and degradation. However, because most studies of RNA regulation do not distinguish the separate contributions of these processes, little is known about how they are temporally integrated. Here we combine metabolic labeling of RNA at high temporal resolution with advanced RNA quantification and computational modeling to estimate RNA transcription and degradation rates during the response of mouse dendritic cells to lipopolysaccharide. We find that changes in transcription rates determine the majority of temporal changes in RNA levels, but that changes in degradation rates are important for shaping sharp 'peaked' responses. We used sequencing of the newly transcribed RNA population to estimate temporally constant RNA processing and degradation rates genome wide. Degradation rates vary significantly between genes and contribute to the observed differences in the dynamic response. Certain transcripts, including those encoding cytokines and transcription factors, mature faster. Our study provides a quantitative approach to study the integrative process of RNA regulation.

DOI: 10.1038/s41590-019-0398-x

Cited 509 times

The immune cell landscape in kidneys of patients with lupus nephritis

Lupus nephritis is a potentially fatal autoimmune disease for which the current treatment is ineffective and often toxic. To develop mechanistic hypotheses of disease, we analyzed kidney samples from patients with lupus nephritis and from healthy control subjects using single-cell RNA sequencing. Our analysis revealed 21 subsets of leukocytes active in disease, including multiple populations of myeloid cells, T cells, natural killer cells and B cells that demonstrated both pro-inflammatory responses and inflammation-resolving responses. We found evidence of local activation of B cells correlated with an age-associated B-cell signature and evidence of progressive stages of monocyte differentiation within the kidney. A clear interferon response was observed in most cells. Two chemokine receptors, CXCR4 and CX3CR1, were broadly expressed, implying a potentially central role in cell trafficking. Gene expression of immune cells in urine and kidney was highly correlated, which would suggest that urine might serve as a surrogate for kidney biopsies.

DOI: 10.1038/nmeth.1276

Cited 477 times

High-resolution mapping of copy-number alterations with massively parallel sequencing

Cancer results from somatic alterations in key genes, including point mutations, copy-number alterations and structural rearrangements. A powerful way to discover cancer-causing genes is to identify genomic regions that show recurrent copy-number alterations (gains and losses) in tumor genomes. Recent advances in sequencing technologies suggest that massively parallel sequencing may provide a feasible alternative to DNA microarrays for detecting copy-number alterations. Here we present: (i) a statistical analysis of the power to detect copy-number alterations of a given size; (ii) SegSeq, an algorithm to segment equal copy numbers from massively parallel sequence data; and (iii) analysis of experimental data from three matched pairs of tumor and normal cell lines. We show that a collection of approximately 14 million aligned sequence reads from human cell lines has comparable power to detect events as the current generation of DNA microarrays and has over twofold better precision for localizing breakpoints (typically, to within approximately 1 kilobase).

DOI: 10.1126/science.1203357

Cited 465 times

Comparative Functional Genomics of the Fission Yeasts

A combined analysis of genome sequence, structure, and expression gives insights into fission yeast biology.

DOI: 10.1126/science.1193070

Cited 405 times

Genome Evolution Following Host Jumps in the Irish Potato Famine Pathogen Lineage

Many plant pathogens, including those in the lineage of the Irish potato famine organism Phytophthora infestans, evolve by host jumps followed by specialization. However, how host jumps affect genome evolution remains largely unknown. To determine the patterns of sequence variation in the P. infestans lineage, we resequenced six genomes of four sister species. This revealed uneven evolutionary rates across genomes with genes in repeat-rich regions showing higher rates of structural polymorphisms and positive selection. These loci are enriched in genes induced in planta, implicating host adaptation in genome evolution. Unexpectedly, genes involved in epigenetic processes formed another class of rapidly evolving residents of the gene-sparse regions. These results demonstrate that dynamic repeat-rich genome compartments underpin accelerated gene evolution following host jumps in this pathogen lineage.

DOI: 10.1016/j.molcel.2012.07.030

Cited 349 times

A High-Throughput Chromatin Immunoprecipitation Approach Reveals Principles of Dynamic Gene Regulation in Mammals

Understanding the principles governing mammalian gene regulation has been hampered by the difficulty in measuring in vivo binding dynamics of large numbers of transcription factors (TF) to DNA. Here, we develop a high-throughput Chromatin ImmunoPrecipitation (HT-ChIP) method to systematically map protein-DNA interactions. HT-ChIP was applied to define the dynamics of DNA binding by 25 TFs and 4 chromatin marks at 4 time-points following pathogen stimulus of dendritic cells. Analyzing over 180,000 TF-DNA interactions we find that TFs vary substantially in their temporal binding landscapes. This data suggests a model for transcription regulation whereby TF networks are hierarchically organized into cell differentiation factors, factors that bind targets prior to stimulus to prime them for induction, and factors that regulate specific gene programs. Overlaying HT-ChIP data on gene-expression dynamics shows that many TF-DNA interactions are established prior to the stimuli, predominantly at immediate-early genes, and identified specific TF ensembles that coordinately regulate gene-induction.

DOI: 10.1002/hep.22549

Cited 340 times

Naturally occurring dominant resistance mutations to hepatitis C virus protease and polymerase inhibitors in treatment-naïve patients

Resistance mutations to hepatitis C virus (HCV) nonstructural protein 3 (NS3) protease inhibitors in <1% of the viral quasispecies may still allow >1000-fold viral load reductions upon treatment, consistent with their reported reduced replicative fitness in vitro. Recently, however, an R155K protease mutation was reported as the dominant quasispecies in a treatment-naïve individual, raising concerns about possible full drug resistance. To investigate the prevalence of dominant resistance mutations against specifically targeted antiviral therapy for HCV (STAT-C) in the population, we analyzed HCV genome sequences from 507 treatment-naïve patients infected with HCV genotype 1 from the United States, Germany, and Switzerland. Phylogenetic sequence analysis and viral load data were used to identify the possible spread of replication-competent, drug-resistant viral strains in the population and to infer the consequences of these mutations upon viral replication in vivo. Mutations described to confer resistance to the protease inhibitors Telaprevir, BILN2061, ITMN-191, SCH6 and Boceprevir; the NS5B polymerase inhibitor AG-021541; and to the NS4A antagonist ACH-806 were observed mostly as sporadic, unrelated cases, at frequencies between 0.3% and 2.8% in the population, including two patients with possible multidrug resistance. Collectively, however, 8.6% of the patients infected with genotype 1a and 1.4% of those infected with genotype 1b carried at least one dominant resistance mutation. Viral loads were high in the majority of these patients, suggesting that drug-resistant viral strains might achieve replication levels comparable to nonresistant viruses in vivo.Naturally occurring dominant STAT-C resistance mutations are common in treatment-naïve patients infected with HCV genotype 1. Their influence on treatment outcome should further be characterized to evaluate possible benefits of drug resistance testing for individual tailoring of drug combinations when treatment options are limited due to previous nonresponse to peginterferon and ribavirin.

DOI: 10.1101/gr.221028.117

Cited 296 times

SvABA: genome-wide detection of structural variants and indels by local assembly

Structural variants (SVs), including small insertion and deletion variants (indels), are challenging to detect through standard alignment-based variant calling methods. Sequence assembly offers a powerful approach to identifying SVs, but is difficult to apply at scale genome-wide for SV detection due to its computational complexity and the difficulty of extracting SVs from assembly contigs. We describe SvABA, an efficient and accurate method for detecting SVs from short-read sequencing data using genome-wide local assembly with low memory and computing requirements. We evaluated SvABA's performance on the NA12878 human genome and in simulated and real cancer genomes. SvABA demonstrates superior sensitivity and specificity across a large spectrum of SVs and substantially improves detection performance for variants in the 20–300 bp range, compared with existing methods. SvABA also identifies complex somatic rearrangements with chains of short (<1000 bp) templated-sequence insertions copied from distant genomic regions. We applied SvABA to 344 cancer genomes from 11 cancer types and found that short templated-sequence insertions occur in ∼4% of all somatic rearrangements. Finally, we demonstrate that SvABA can identify sites of viral integration and cancer driver alterations containing medium-sized (50–300 bp) SVs.

DOI: 10.1016/j.cell.2015.06.007

Cited 291 times

Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone

The 2013-2015 Ebola virus disease (EVD) epidemic is caused by the Makona variant of Ebola virus (EBOV). Early in the epidemic, genome sequencing provided insights into virus evolution and transmission and offered important information for outbreak response. Here, we analyze sequences from 232 patients sampled over 7 months in Sierra Leone, along with 86 previously released genomes from earlier in the epidemic. We confirm sustained human-to-human transmission within Sierra Leone and find no evidence for import or export of EBOV across national borders after its initial introduction. Using high-depth replicate sequencing, we observe both host-to-host transmission and recurrent emergence of intrahost genetic variants. We trace the increasing impact of purifying selection in suppressing the accumulation of nonsynonymous mutations over time. Finally, we note changes in the mucin-like domain of EBOV glycoprotein that merit further investigation. These findings clarify the movement of EBOV within the region and describe viral evolution during prolonged human-to-human transmission.

DOI: 10.1890/12-1693.1

Cited 291 times

A first comprehensive census of fungi in soil reveals both hyperdiversity and fine‐scale niche partitioning

Fungi play key roles in ecosystems as mutualists, pathogens, and decomposers. Current estimates of global species richness are highly uncertain, and the importance of stochastic vs. deterministic forces in the assembly of fungal communities is unknown. Molecular studies have so far failed to reach saturated, comprehensive estimates of fungal diversity. To obtain a more accurate estimate of global fungal diversity, we used a direct molecular approach to census diversity in a boreal ecosystem with precisely known plant diversity, and we carefully evaluated adequacy of sampling and accuracy of species delineation. We achieved the first exhaustive enumeration of fungi in soil, recording 1002 taxa in this system. We show that the fungus : plant ratio in Picea mariana forest soils from interior Alaska is at least 17:1 and is regionally stable. A global extrapolation of this ratio would suggest 6 million species of fungi, as opposed to leading estimates ranging from 616 000 to 1.5 million. We also find that closely related fungi often occupy divergent niches. This pattern is seen in fungi spanning all major functional guilds and four phyla, suggesting a major role of deterministic niche partitioning in community assembly. Extinctions and range shifts are reorganizing biodiversity on Earth, yet our results suggest that 98% of fungi remain undescribed and that many of these species occupy unique niches.

DOI: 10.1101/gr.103697.109

Cited 262 times

Integrative analysis of the melanoma transcriptome

Global studies of transcript structure and abundance in cancer cells enable the systematic discovery of aberrations that contribute to carcinogenesis, including gene fusions, alternative splice isoforms, and somatic mutations. We developed a systematic approach to characterize the spectrum of cancer-associated mRNA alterations through integration of transcriptomic and structural genomic data, and we applied this approach to generate new insights into melanoma biology. Using paired-end massively parallel sequencing of cDNA (RNA-seq) together with analyses of high-resolution chromosomal copy number data, we identified 11 novel melanoma gene fusions produced by underlying genomic rearrangements, as well as 12 novel readthrough transcripts. We mapped these chimeric transcripts to base-pair resolution and traced them to their genomic origins using matched chromosomal copy number information. We also used these data to discover and validate base-pair mutations that accumulated in these melanomas, revealing a surprisingly high rate of somatic mutation and lending support to the notion that point mutations constitute the major driver of melanoma progression. Taken together, these results may indicate new avenues for target discovery in melanoma, while also providing a template for large-scale transcriptome studies across many tumor types.

DOI: 10.1073/pnas.1121491109

Cited 260 times

Genomic epidemiology of the <i>Escherichia coli</i> O104:H4 outbreaks in Europe, 2011

The degree to which molecular epidemiology reveals information about the sources and transmission patterns of an outbreak depends on the resolution of the technology used and the samples studied. Isolates of Escherichia coli O104:H4 from the outbreak centered in Germany in May-July 2011, and the much smaller outbreak in southwest France in June 2011, were indistinguishable by standard tests. We report a molecular epidemiological analysis using multiplatform whole-genome sequencing and analysis of multiple isolates from the German and French outbreaks. Isolates from the German outbreak showed remarkably little diversity, with only two single nucleotide polymorphisms (SNPs) found in isolates from four individuals. Surprisingly, we found much greater diversity (19 SNPs) in isolates from seven individuals infected in the French outbreak. The German isolates form a clade within the more diverse French outbreak strains. Moreover, five isolates derived from a single infected individual from the French outbreak had extremely limited diversity. The striking difference in diversity between the German and French outbreak samples is consistent with several hypotheses, including a bottleneck that purged diversity in the German isolates, variation in mutation rates in the two E. coli outbreak populations, or uneven distribution of diversity in the seed populations that led to each outbreak.

DOI: 10.1038/ncomms3325

Cited 257 times

The Capsaspora genome reveals a complex unicellular prehistory of animals

To reconstruct the evolutionary origin of multicellular animals from their unicellular ancestors, the genome sequences of diverse unicellular relatives are essential. However, only the genome of the choanoflagellate Monosiga brevicollis has been reported to date. Here we completely sequence the genome of the filasterean Capsaspora owczarzaki, the closest known unicellular relative of metazoans besides choanoflagellates. Analyses of this genome alter our understanding of the molecular complexity of metazoans' unicellular ancestors showing that they had a richer repertoire of proteins involved in cell adhesion and transcriptional regulation than previously inferred only with the choanoflagellate genome. Some of these proteins were secondarily lost in choanoflagellates. In contrast, most intercellular signalling systems controlling development evolved later concomitant with the emergence of the first metazoans. We propose that the acquisition of these metazoan-specific developmental systems and the co-option of pre-existing genes drove the evolutionary transition from unicellular protists to metazoans.

DOI: 10.1101/gr.070227.107

Cited 249 times

Quality scores and SNP detection in sequencing-by-synthesis systems

Promising new sequencing technologies, based on sequencing-by-synthesis (SBS), are starting to deliver large amounts of DNA sequence at very low cost. Polymorphism detection is a key application. We describe general methods for improved quality scores and accurate automated polymorphism detection, and apply them to data from the Roche (454) Genome Sequencer 20. We assess our methods using known-truth data sets, which is critical to the validity of the assessments. We developed informative, base-by-base error predictors for this sequencer and used a variant of the phred binning algorithm to combine them into a single empirically derived quality score. These quality scores are more useful than those produced by the system software: They both better predict actual error rates and identify many more high-quality bases. We developed a SNP detection method, with variants for low coverage, high coverage, and PCR amplicon applications, and evaluated it on known-truth data sets. We demonstrate good specificity in single reads, and excellent specificity (no false positives in 215 kb of genome) in high-coverage data.

DOI: 10.1038/ng.2543

Cited 239 times

Mutations causing medullary cystic kidney disease type 1 lie in a large VNTR in MUC1 missed by massively parallel sequencing

Although genetic lesions responsible for some mendelian disorders can be rapidly discovered through massively parallel sequencing of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing and de novo assembly did we find that each of six families with MCKD1 harbors an equivalent but apparently independently arising mutation in sequence markedly under-represented in massively parallel sequencing data: the insertion of a single cytosine in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (∼1.5-5 kb), GC-rich (>80%) coding variable-number tandem repeat (VNTR) sequence in the MUC1 gene encoding mucin 1. These results provide a cautionary tale about the challenges in identifying the genes responsible for mendelian, let alone more complex, disorders through massively parallel sequencing.

DOI: 10.1186/1471-2164-13-375

Cited 233 times

Pacific biosciences sequencing technology for genotyping and variation discovery in human data

Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects.We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis.Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.

DOI: 10.1186/gb-2013-14-2-r15

Cited 223 times

Premetazoan genome evolution and the regulation of cell differentiation in the choanoflagellate Salpingoeca rosetta

Metazoan multicellularity is rooted in mechanisms of cell adhesion, signaling, and differentiation that first evolved in the progenitors of metazoans. To reconstruct the genome composition of metazoan ancestors, we sequenced the genome and transcriptome of the choanoflagellate Salpingoeca rosetta, a close relative of metazoans that forms rosette-shaped colonies of cells.A comparison of the 55 Mb S. rosetta genome with genomes from diverse opisthokonts suggests that the origin of metazoans was preceded by a period of dynamic gene gain and loss. The S. rosetta genome encodes homologs of cell adhesion, neuropeptide, and glycosphingolipid metabolism genes previously found only in metazoans and expands the repertoire of genes inferred to have been present in the progenitors of metazoans and choanoflagellates. Transcriptome analysis revealed that all four S. rosetta septins are upregulated in colonies relative to single cells, suggesting that these conserved cytokinesis proteins may regulate incomplete cytokinesis during colony development. Furthermore, genes shared exclusively by metazoans and choanoflagellates were disproportionately upregulated in colonies and the single cells from which they develop.The S. rosetta genome sequence refines the catalog of metazoan-specific genes while also extending the evolutionary history of certain gene families that are central to metazoan biology. Transcriptome data suggest that conserved cytokinesis genes, including septins, may contribute to S. rosetta colony formation and indicate that the initiation of colony development may preferentially draw upon genes shared with metazoans, while later stages of colony maturation are likely regulated by genes unique to S. rosetta.

DOI: 10.1186/gb-2012-13-3-r23

Cited 213 times

Efficient and robust RNA-seq process for cultured bacteria and complex community transcriptomes

We have developed a process for transcriptome analysis of bacterial communities that accommodates both intact and fragmented starting RNA and combines efficient rRNA removal with strand-specific RNA-seq. We applied this approach to an RNA mixture derived from three diverse cultured bacterial species and to RNA isolated from clinical stool samples. The resulting expression profiles were highly reproducible, enriched up to 40-fold for non-rRNA transcripts, and correlated well with profiles representing undepleted total RNA.

DOI: 10.1038/ng.3121

Cited 213 times

Comprehensive variation discovery in single human genomes

Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.

DOI: 10.1186/1471-2164-13-734

Cited 210 times

How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes?

High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is estimating a priori the number of reads per sample needed to detect and quantify thousands of individual transcripts with a large dynamic range of abundance.We have conducted a systematic examination of how changes in the number of RNA-Seq reads per sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among samples. Our findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and numerous genes differentially expressed between samples can be identified.Our analysis provides a guide for the many researchers seeking to determine the appropriate sequencing depth for RNA-Seq-based studies of diverse bacterial species.

DOI: 10.1101/gr.141515.112

Cited 207 times

Finished bacterial genomes from shotgun sequence data

Exceptionally accurate genome reference sequences have proven to be of great value to microbial researchers. Thus, to date, about 1800 bacterial genome assemblies have been "finished" at great expense with the aid of manual laboratory and computational processes that typically iterate over a period of months or even years. By applying a new laboratory design and new assembly algorithm to 16 samples, we demonstrate that assemblies exceeding finished quality can be obtained from whole-genome shotgun data and automated computation. Cost and time requirements are thus dramatically reduced.

DOI: 10.1038/ncomms10740

Cited 145 times

Genome analysis of three Pneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts

Pneumocystis jirovecii is a major cause of life-threatening pneumonia in immunosuppressed patients including transplant recipients and those with HIV/AIDS, yet surprisingly little is known about the biology of this fungal pathogen. Here we report near complete genome assemblies for three Pneumocystis species that infect humans, rats and mice. Pneumocystis genomes are highly compact relative to other fungi, with substantial reductions of ribosomal RNA genes, transporters, transcription factors and many metabolic pathways, but contain expansions of surface proteins, especially a unique and complex surface glycoprotein superfamily, as well as proteases and RNA processing proteins. Unexpectedly, the key fungal cell wall components chitin and outer chain N-mannans are absent, based on genome content and experimental validation. Our findings suggest that Pneumocystis has developed unique mechanisms of adaptation to life exclusively in mammalian hosts, including dependence on the lungs for gas and nutrients and highly efficient strategies to escape both host innate and acquired immune defenses.

DOI: 10.1101/gr.2674004

Cited 234 times

The Complete Genome and Proteome of <i>Mycoplasma mobile</i>

Although often considered "minimal" organisms, mycoplasmas show a wide range of diversity with respect to host environment, phenotypic traits, and pathogenicity. Here we report the complete genomic sequence and proteogenomic map for the piscine mycoplasma Mycoplasma mobile, noted for its robust gliding motility. For the first time, proteomic data are used in the primary annotation of a new genome, providing validation of expression for many of the predicted proteins. Several novel features were discovered including a long repeating unit of DNA of approximately 2435 bp present in five complete copies that are shown to code for nearly identical yet uniquely expressed proteins. M. mobile has among the lowest DNA GC contents (24.9%) and most reduced set of tRNAs of any organism yet reported (28). Numerous instances of tandem duplication as well as lateral gene transfer are evident in the genome. The multiple available complete genome sequences for other motile and immotile mycoplasmas enabled us to use comparative genomic and phylogenetic methods to suggest several candidate genes that might be involved in motility. The results of these analyses leave open the possibility that gliding motility might have arisen independently more than once in the mycoplasma lineage.

DOI: 10.1073/pnas.0812841106

Cited 213 times

Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing

Defining the transcriptome, the repertoire of transcribed regions encoded in the genome, is a challenging experimental task. Current approaches, relying on sequencing of ESTs or cDNA libraries, are expensive and labor-intensive. Here, we present a general approach for ab initio discovery of the complete transcriptome of the budding yeast, based only on the unannotated genome sequence and millions of short reads from a single massively parallel sequencing run. Using novel algorithms, we automatically construct a highly accurate transcript catalog. Our approach automatically and fully defines 86% of the genes expressed under the given conditions, and discovers 160 previously undescribed transcription units of 250 bp or longer. It correctly demarcates the 5' and 3' UTR boundaries of 86 and 77% of expressed genes, respectively. The method further identifies 83% of known splice junctions in expressed genes, and discovers 25 previously uncharacterized introns, including 2 cases of condition-dependent intron retention. Our framework is applicable to poorly understood organisms, and can lead to greater understanding of the transcribed elements in an explored genome.

DOI: 10.1371/journal.pone.0005683

Cited 209 times

Quantitative Deep Sequencing Reveals Dynamic HIV-1 Escape and Large Population Shifts during CCR5 Antagonist Therapy In Vivo

High-throughput sequencing platforms provide an approach for detecting rare HIV-1 variants and documenting more fully quasispecies diversity. We applied this technology to the V3 loop-coding region of env in samples collected from 4 chronically HIV-infected subjects in whom CCR5 antagonist (vicriviroc [VVC]) therapy failed. Between 25,000-140,000 amplified sequences were obtained per sample. Profound baseline V3 loop sequence heterogeneity existed; predicted CXCR4-using populations were identified in a largely CCR5-using population. The V3 loop forms associated with subsequent virologic failure, either through CXCR4 use or the emergence of high-level VVC resistance, were present as minor variants at 0.8-2.8% of baseline samples. Extreme, rapid shifts in population frequencies toward these forms occurred, and deep sequencing provided a detailed view of the rapid evolutionary impact of VVC selection. Greater V3 diversity was observed post-selection. This previously unreported degree of V3 loop sequence diversity has implications for viral pathogenesis, vaccine design, and the optimal use of HIV-1 CCR5 antagonists.

DOI: 10.1371/journal.pone.0022365

Cited 191 times

High-Resolution Description of Antibody Heavy-Chain Repertoires in Humans

Antibodies' protective, pathological, and therapeutic properties result from their considerable diversity. This diversity is almost limitless in potential, but actual diversity is still poorly understood. Here we use deep sequencing to characterize the diversity of the heavy-chain CDR3 region, the most important contributor to antibody binding specificity, and the constituent V, D, and J segments that comprise it. We find that, during the stepwise D-J and then V-DJ recombination events, the choice of D and J segments exert some bias on each other; however, we find the choice of the V segment is essentially independent of both. V, D, and J segments are utilized with different frequencies, resulting in a highly skewed representation of VDJ combinations in the repertoire. Nevertheless, the pattern of segment usage was almost identical between two different individuals. The pattern of V, D, and J segment usage and recombination was insufficient to explain overlap that was observed between the two individuals' CDR3 repertoires. Finally, we find that while there are a near-infinite number of heavy-chain CDR3s in principle, there are about 3-9 million in the blood of an adult human being.

DOI: 10.1101/gr.3722605

Cited 183 times

Assembly of polymorphic genomes: Algorithms and application to <i>Ciona savignyi</i>

Whole-genome assembly is now used routinely to obtain high-quality draft sequence for the genomes of species with low levels of polymorphism. However, genome assembly remains extremely challenging for highly polymorphic species. The difficulty arises because two divergent haplotypes are sequenced together, making it difficult to distinguish alleles at the same locus from paralogs at different loci. We present here a method for assembling highly polymorphic diploid genomes that involves assembling the two haplotypes separately and then merging them to obtain a reference sequence. Our method was developed to assemble the genome of the sea squirt Ciona savignyi, which was sequenced to a depth of 12.7 x from a single wild individual. By comparing finished clones of the two haplotypes we determined that the sequenced individual had an extremely high heterozygosity rate, averaging 4.6% with significant regional variation and rearrangements at all physical scales. Applied to these data, our method produced a reference assembly covering 157 Mb, with N50 contig and scaffold sizes of 47 kb and 989 kb, respectively. Alignment of ESTs indicates that 88% of loci are present at least once and 81% exactly once in the reference assembly. Our method represented loci in a single copy more reliably and achieved greater contiguity than a conventional whole-genome assembly method.

DOI: 10.1186/gb-2009-10-10-r115

Cited 177 times

Targeted next-generation sequencing of a cancer transcriptome enhances detection of sequence variants and novel fusion transcripts

Targeted RNA-Seq combines next-generation sequencing with capture of sequences from a relevant subset of a transcriptome. When testing by capturing sequences from a tumor cDNA library by hybridization to oligonucleotide probes specific for 467 cancer-related genes, this method showed high selectivity, improved mutation detection enabling discovery of novel chimeric transcripts, and provided RNA expression data. Thus, targeted RNA-Seq produces an enhanced view of the molecular state of a set of "high interest" genes.

DOI: 10.1186/gb-2009-10-10-r103

Cited 175 times

ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).

DOI: 10.1111/j.1364-3703.2009.00593.x

Cited 171 times

Ten things to know about oomycete effectors

Long considered intractable organisms by fungal genetic research standards, the oomycetes have recently moved to the centre stage of research on plant-microbe interactions. Recent work on oomycete effector evolution, trafficking and function has led to major conceptual advances in the science of plant pathology. In this review, we provide a historical perspective on oomycete genetic research and summarize the state of the art in effector biology of plant pathogenic oomycetes by describing what we consider to be the 10 most important concepts about oomycete effectors.

DOI: 10.1371/journal.pgen.1003272

Cited 168 times

Distinctive Expansion of Potential Virulence Genes in the Genome of the Oomycete Fish Pathogen Saprolegnia parasitica

Oomycetes in the class Saprolegniomycetidae of the Eukaryotic kingdom Stramenopila have evolved as severe pathogens of amphibians, crustaceans, fish and insects, resulting in major losses in aquaculture and damage to aquatic ecosystems. We have sequenced the 63 Mb genome of the fresh water fish pathogen, Saprolegnia parasitica. Approximately 1/3 of the assembled genome exhibits loss of heterozygosity, indicating an efficient mechanism for revealing new variation. Comparison of S. parasitica with plant pathogenic oomycetes suggests that during evolution the host cellular environment has driven distinct patterns of gene expansion and loss in the genomes of plant and animal pathogens. S. parasitica possesses one of the largest repertoires of proteases (270) among eukaryotes that are deployed in waves at different points during infection as determined from RNA-Seq data. In contrast, despite being capable of living saprotrophically, parasitism has led to loss of inorganic nitrogen and sulfur assimilation pathways, strikingly similar to losses in obligate plant pathogenic oomycetes and fungi. The large gene families that are hallmarks of plant pathogenic oomycetes such as Phytophthora appear to be lacking in S. parasitica, including those encoding RXLR effectors, Crinkler's, and Necrosis Inducing-Like Proteins (NLP). S. parasitica also has a very large kinome of 543 kinases, 10% of which is induced upon infection. Moreover, S. parasitica encodes several genes typical of animals or animal-pathogens and lacking from other oomycetes, including disintegrins and galactose-binding lectins, whose expression and evolutionary origins implicate horizontal gene transfer in the evolution of animal pathogenesis in S. parasitica.

DOI: 10.1104/pp.105.068718

Cited 163 times

Structure and Architecture of the Maize Genome

Abstract Maize (Zea mays or corn) plays many varied and important roles in society. It is not only an important experimental model plant, but also a major livestock feed crop and a significant source of industrial products such as sweeteners and ethanol. In this study we report the systematic analysis of contiguous sequences of the maize genome. We selected 100 random regions averaging 144 kb in size, representing about 0.6% of the genome, and generated a high-quality dataset for sequence analysis. This sampling contains 330 annotated genes, 91% of which are supported by expressed sequence tag data from maize and other cereal species. Genes averaged 4 kb in size with five exons, although the largest was over 59 kb with 31 exons. Gene density varied over a wide range from 0.5 to 10.7 genes per 100 kb and genes did not appear to cluster significantly. The total repetitive element content we observed (66%) was slightly higher than previous whole-genome estimates (58%–63%) and consisted almost exclusively of retroelements. The vast majority of genes can be aligned to at least one sequence read derived from gene-enrichment procedures, but only about 30% are fully covered. Our results indicate that much of the increase in genome size of maize relative to rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana) is attributable to an increase in number of both repetitive elements and genes.

DOI: 10.1186/1471-2164-9-539

Cited 156 times

Short-term genome evolution of Listeria monocytogenes in a non-controlled environment

While increasing data on bacterial evolution in controlled environments are available, our understanding of bacterial genome evolution in natural environments is limited. We thus performed full genome analyses on four Listeria monocytogenes, including human and food isolates from both a 1988 case of sporadic listeriosis and a 2000 listeriosis outbreak, which had been linked to contaminated food from a single processing facility. All four isolates had been shown to have identical subtypes, suggesting that a specific L. monocytogenes strain persisted in this processing plant over at least 12 years. While a genome sequence for the 1988 food isolate has been reported, we sequenced the genomes of the 1988 human isolate as well as a human and a food isolate from the 2000 outbreak to allow for comparative genome analyses.The two L. monocytogenes isolates from 1988 and the two isolates from 2000 had highly similar genome backbone sequences with very few single nucleotide (nt) polymorphisms (1 - 8 SNPs/isolate; confirmed by re-sequencing). While no genome rearrangements were identified in the backbone genome of the four isolates, a 42 kb prophage inserted in the chromosomal comK gene showed evidence for major genome rearrangements. The human-food isolate pair from each 1988 and 2000 had identical prophage sequence; however, there were significant differences in the prophage sequences between the 1988 and 2000 isolates. Diversification of this prophage appears to have been caused by multiple homologous recombination events or possibly prophage replacement. In addition, only the 2000 human isolate contained a plasmid, suggesting plasmid loss or acquisition events. Surprisingly, besides the polymorphisms found in the comK prophage, a single SNP in the tRNA Thr-4 prophage represents the only SNP that differentiates the 1988 isolates from the 2000 isolates.Our data support the hypothesis that the 2000 human listeriosis outbreak was caused by a L. monocytogenes strain that persisted in a food processing facility over 12 years and show that genome sequencing is a valuable and feasible tool for retrospective epidemiological analyses. Short-term evolution of L. monocytogenes in non-controlled environments appears to involve limited diversification beyond plasmid gain or loss and prophage diversification, highlighting the importance of phages in bacterial evolution.

DOI: 10.7554/elife.01287

Cited 145 times

Regulated aggregative multicellularity in a close unicellular relative of metazoa

The evolution of metazoans from their unicellular ancestors was one of the most important events in the history of life. However, the cellular and genetic changes that ultimately led to the evolution of multicellularity are not known. In this study, we describe an aggregative multicellular stage in the protist Capsaspora owczarzaki, a close unicellular relative of metazoans. Remarkably, transition to the aggregative stage is associated with significant upregulation of orthologs of genes known to establish multicellularity and tissue architecture in metazoans. We further observe transitions in regulated alternative splicing during the C. owczarzaki life cycle, including the deployment of an exon network associated with signaling, a feature of splicing regulation so far only observed in metazoans. Our results reveal the existence of a highly regulated aggregative stage in C. owczarzaki and further suggest that features of aggregative behavior in an ancestral protist may had been co-opted to develop some multicellular properties currently seen in metazoans.

DOI: 10.1186/gb-2010-11-8-r87

Cited 134 times

Strand-specific RNA sequencing reveals extensive regulated long antisense transcripts that are conserved across yeast species

Recent studies in budding yeast have shown that antisense transcription occurs at many loci. However, the functional role of antisense transcripts has been demonstrated only in a few cases and it has been suggested that most antisense transcripts may result from promiscuous bi-directional transcription in a dense genome. Here, we use strand-specific RNA sequencing to study anti-sense transcription in Saccharomyces cerevisiae. We detect 1,103 putative antisense transcripts expressed in mid-log phase growth, ranging from 39 short transcripts covering only the 3' UTR of sense genes to 145 long transcripts covering the entire sense open reading frame. Many of these antisense transcripts overlap sense genes that are repressed in mid-log phase and are important in stationary phase, stress response, or meiosis. We validate the differential regulation of 67 antisense transcripts and their sense targets in relevant conditions, including nutrient limitation and environmental stresses. Moreover, we show that several antisense transcripts and, in some cases, their differential expression have been conserved across five species of yeast spanning 150 million years of evolution. Divergence in the regulation of antisense transcripts to two respiratory genes coincides with the evolution of respiro-fermentation. Our work provides support for a global and conserved role for antisense transcription in yeast gene regulation.

DOI: 10.7554/elife.19090

Cited 133 times

Structure of the germline genome of Tetrahymena thermophila and relationship to the massively rearranged somatic genome

The germline genome of the binucleated ciliate Tetrahymena thermophila undergoes programmed chromosome breakage and massive DNA elimination to generate the somatic genome. Here, we present a complete sequence assembly of the germline genome and analyze multiple features of its structure and its relationship to the somatic genome, shedding light on the mechanisms of genome rearrangement as well as the evolutionary history of this remarkable germline/soma differentiation. Our results strengthen the notion that a complex, dynamic, and ongoing interplay between mobile DNA elements and the host genome have shaped Tetrahymena chromosome structure, locally and globally. Non-standard outcomes of rearrangement events, including the generation of short-lived somatic chromosomes and excision of DNA interrupting protein-coding regions, may represent novel forms of developmental gene regulation. We also compare Tetrahymena's germline/soma differentiation to that of other characterized ciliates, illustrating the wide diversity of adaptations that have occurred within this phylum.

DOI: 10.1073/pnas.0905222106

Cited 129 times

High-resolution profiling and discovery of planarian small RNAs

Freshwater planarian flatworms possess uncanny regenerative capacities mediated by abundant and collectively totipotent adult stem cells. Key functions of these cells during regeneration and tissue homeostasis have been shown to depend on PIWI, a molecule required for Piwi-interacting RNA (piRNA) expression in planarians. Nevertheless, the full complement of piRNAs and microRNAs (miRNAs) in this organism has yet to be defined. Here we report on the large-scale cloning and sequencing of small RNAs from the planarian Schmidtea mediterranea, yielding altogether millions of sequenced, unique small RNAs. We show that piRNAs are in part organized in genomic clusters and that they share characteristic features with mammalian and fly piRNAs. We further identify 61 novel miRNA genes and thus double the number of known planarian miRNAs. Sequencing, as well as quantitative PCR of small RNAs, uncovered 10 miRNAs enriched in planarian stem cells. These miRNAs are down-regulated in animals in which stem cells have been abrogated by irradiation, and thus constitute miRNAs likely associated with specific stem-cell functions. Altogether, we present the first comprehensive small RNA analysis in animals belonging to the third animal superphylum, the Lophotrochozoa, and single out a number of miRNAs that may function in regeneration. Several of these miRNAs are deeply conserved in animals.

DOI: 10.1111/mec.12743

Cited 122 times

Rich and cold: diversity, distribution and drivers of fungal communities in patterned‐ground ecosystems of the <scp>N</scp>orth <scp>A</scp>merican <scp>A</scp>rctic

Abstract Fungi are abundant and functionally important in the A rctic, yet comprehensive studies of their diversity in relation to geography and environment are not available. We sampled soils in paired plots along the N orth A merican A rctic T ransect ( NAAT ), which spans all five bioclimatic subzones of the A rctic. Each pair of plots contrasted relatively bare, cryoturbated patterned‐ground features ( PGF s) and adjacent vegetated between patterned‐ground features (b PGF s). Fungal communities were analysed via sequencing of 7834 ITS ‐ LSU clones. We recorded 1834 OTU s – nearly half the fungal richness previously reported for the entire A rctic. These OTU s spanned eight phyla, 24 classes, 75 orders and 120 families, but were dominated by Ascomycota, with one‐fifth belonging to lichens. Species richness did not decline with increasing latitude, although there was a decline in mycorrhizal taxa that was offset by an increase in lichen taxa. The dominant OTU s were widespread even beyond the A rctic, demonstrating no dispersal limitation. Yet fungal communities were distinct in each subzone and were correlated with soil p H , climate and vegetation. Communities in subzone E were distinct from the other subzones, but similar to those of the boreal forest. Fungal communities on disturbed PGF s differed significantly from those of paired stable areas in b PGF s. Indicator species for PGF s included lichens and saprotrophic fungi, while bPGF s were characterized by ectomycorrhizal and pathogenic fungi. Our results suggest that the A rctic does not host a unique mycoflora, while A rctic fungi are highly sensitive to climate and vegetation, with potential to migrate rapidly as global change unfolds.

DOI: 10.1371/journal.pone.0096094

Cited 114 times

Massively Parallel Sequencing of Human Urinary Exosome/Microvesicle RNA Reveals a Predominance of Non-Coding RNA

Intact RNA from exosomes/microvesicles (collectively referred to as microvesicles) has sparked much interest as potential biomarkers for the non-invasive analysis of disease. Here we use the Illumina Genome Analyzer to determine the comprehensive array of nucleic acid reads present in urinary microvesicles. Extraneous nucleic acids were digested using RNase and DNase treatment and the microvesicle inner nucleic acid cargo was analyzed with and without DNase digestion to examine both DNA and RNA sequences contained in microvesicles. Results revealed that a substantial proportion (∼87%) of reads aligned to ribosomal RNA. Of the non-ribosomal RNA sequences, ∼60% aligned to non-coding RNA and repeat sequences including LINE, SINE, satellite repeats, and RNA repeats (tRNA, snRNA, scRNA and srpRNA). The remaining ∼40% of non-ribosomal RNA reads aligned to protein coding genes and splice sites encompassing approximately 13,500 of the known 21,892 protein coding genes of the human genome. Analysis of protein coding genes specific to the renal and genitourinary tract revealed that complete segments of the renal nephron and collecting duct as well as genes indicative of the bladder and prostate could be identified. This study reveals that the entire genitourinary system may be mapped using microvesicle transcript analysis and that the majority of non-ribosomal RNA sequences contained in microvesicles is potentially functional non-coding RNA, which play an emerging role in cell regulation.

DOI: 10.1016/j.cub.2009.11.035

Cited 113 times

A Cellular Memory of Developmental History Generates Phenotypic Diversity in C. elegans

Early life experiences have a major impact on adult phenotypes [1-3]. However, the mechanisms by which animals retain a cellular memory of early experience are not well understood. Here we show that adult wild-type Caenorhabditis elegans that transiently pass through the stress-resistant dauer larval stage exhibit distinct gene expression profiles and life history traits, as compared to adult animals that bypassed this stage. Using chromatin immunoprecipitation experiments coupled with massively parallel sequencing, we found that genome-wide levels of specific histone tail modifications are markedly altered in postdauer animals. Mutations in subsets of genes implicated in chromatin remodeling abolish, or alter, the observed changes in gene expression and life history traits in postdauer animals. Modifications to the epigenome as a consequence of early experience may contribute in part to a memory of early experience and generate phenotypic variation in an isogenic population.

DOI: 10.1186/gb-2011-12-8-r73

Cited 103 times

Hybrid selection for sequencing pathogen genomes from clinical samples

We have adapted a solution hybrid selection protocol to enrich pathogen DNA in clinical samples dominated by human genetic material. Using mock mixtures of human and Plasmodium falciparum malaria parasite DNA as well as clinical samples from infected patients, we demonstrate an average of approximately 40-fold enrichment of parasite DNA after hybrid selection. This approach will enable efficient genome sequencing of pathogens from clinical samples, as well as sequencing of endosymbiotic organisms such as Wolbachia that live inside diverse metazoan phyla.

DOI: 10.1186/s13075-018-1631-y

Methods for high-dimensional analysis of cells dissociated from cryopreserved synovial tissue

Detailed molecular analyses of cells from rheumatoid arthritis (RA) synovium hold promise in identifying cellular phenotypes that drive tissue pathology and joint damage. The Accelerating Medicines Partnership RA/SLE Network aims to deconstruct autoimmune pathology by examining cells within target tissues through multiple high-dimensional assays. Robust standardized protocols need to be developed before cellular phenotypes at a single cell level can be effectively compared across patient samples.Multiple clinical sites collected cryopreserved synovial tissue fragments from arthroplasty and synovial biopsy in a 10% DMSO solution. Mechanical and enzymatic dissociation parameters were optimized for viable cell extraction and surface protein preservation for cell sorting and mass cytometry, as well as for reproducibility in RNA sequencing (RNA-seq). Cryopreserved synovial samples were collectively analyzed at a central processing site by a custom-designed and validated 35-marker mass cytometry panel. In parallel, each sample was flow sorted into fibroblast, T-cell, B-cell, and macrophage suspensions for bulk population RNA-seq and plate-based single-cell CEL-Seq2 RNA-seq.Upon dissociation, cryopreserved synovial tissue fragments yielded a high frequency of viable cells, comparable to samples undergoing immediate processing. Optimization of synovial tissue dissociation across six clinical collection sites with ~ 30 arthroplasty and ~ 20 biopsy samples yielded a consensus digestion protocol using 100 μg/ml of Liberase™ TL enzyme preparation. This protocol yielded immune and stromal cell lineages with preserved surface markers and minimized variability across replicate RNA-seq transcriptomes. Mass cytometry analysis of cells from cryopreserved synovium distinguished diverse fibroblast phenotypes, distinct populations of memory B cells and antibody-secreting cells, and multiple CD4+ and CD8+ T-cell activation states. Bulk RNA-seq of sorted cell populations demonstrated robust separation of synovial lymphocytes, fibroblasts, and macrophages. Single-cell RNA-seq produced transcriptomes of over 1000 genes/cell, including transcripts encoding characteristic lineage markers identified.We have established a robust protocol to acquire viable cells from cryopreserved synovial tissue with intact transcriptomes and cell surface phenotypes. A centralized pipeline to generate multiple high-dimensional analyses of synovial tissue samples collected across a collaborative network was developed. Integrated analysis of such datasets from large patient cohorts may help define molecular heterogeneity within RA pathology and identify new therapeutic targets and biomarkers.

DOI: 10.1016/j.cell.2023.02.018

The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models

Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × ∼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.

Cited 134 times

INTERNATIONAL HUMAN GENOME SEQUENCING CONSORTIUM. INITIAL SEQUENCING AND ANALYSIS OF THE HUMAN GENOME

DOI: 10.1038/nature04689

Cited 127 times

DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage

Chromosome 17 is unusual among the human chromosomes in many respects. It is the largest human autosome with orthology to only a single mouse chromosome, mapping entirely to the distal half of mouse chromosome 11. Chromosome 17 is rich in protein-coding genes, having the second highest gene density in the genome. It is also enriched in segmental duplications, ranking third in density among the autosomes. Here we report a finished sequence for human chromosome 17, as well as a structural comparison with the finished sequence for mouse chromosome 11, the first finished mouse chromosome. Comparison of the orthologous regions reveals striking differences. In contrast to the typical pattern seen in mammalian evolution, the human sequence has undergone extensive intrachromosomal rearrangement, whereas the mouse sequence has been remarkably stable. Moreover, although the human sequence has a high density of segmental duplication, the mouse sequence has a very low density. Notably, these segmental duplications correspond closely to the sites of structural rearrangement, demonstrating a link between duplication and rearrangement. Examination of the main classes of duplicated segments provides insight into the dynamics underlying expansion of chromosome-specific, low-copy repeats in the human genome.

DOI: 10.1126/science.284.5421.1800

Cited 123 times

Dosage Compensation Proteins Targeted to X Chromosomes by a Determinant of Hermaphrodite Fate

In many organisms, master control genes coordinately regulate sex-specific aspects of development. SDC-2 was shown to induce hermaphrodite sexual differentiation and activate X chromosome dosage compensation in Caenorhabditis elegans . To control these distinct processes, SDC-2 acts as a strong gene-specific repressor and a weaker chromosome-wide repressor. To initiate hermaphrodite development, SDC-2 associates with the promoter of the male sex-determining gene her-1 to repress its transcription. To activate dosage compensation, SDC-2 triggers assembly of a specialized protein complex exclusively on hermaphrodite X chromosomes to reduce gene expression by half. SDC-2 can localize to X chromosomes without other components of the dosage compensation complex, suggesting that SDC-2 targets dosage compensation machinery to X chromosomes.

DOI: 10.1086/380648

Cited 122 times

The Breakpoint Region of the Most Common Isochromosome, i(17q), in Human Neoplasia Is Characterized by a Complex Genomic Architecture with Large, Palindromic, Low-Copy Repeats

Although a great deal of information has accumulated regarding the mechanisms underlying constitutional DNA rearrangements associated with inherited disorders, virtually nothing is known about the molecular processes involved in acquired neoplasia-associated chromosomal rearrangements. Isochromosome 17q, or "i(17q)," is one of the most common structural abnormalities observed in human neoplasms. We previously identified a breakpoint cluster region for i(17q) formation in 17p11.2 and hypothesized that genome architectural features could be responsible for this clustering. To address this hypothesis, we precisely mapped the i(17q) breakpoints in 11 patients with different hematologic malignancies and determined the genomic structure of the involved region. Our results reveal a complex genomic architecture in the i(17q) breakpoint cluster region, characterized by large ( approximately 38-49-kb), palindromic, low-copy repeats, strongly suggesting that somatic rearrangements are not random events but rather reflect susceptibilities due to the genomic structure.

DOI: 10.1101/gr.5338906

Cited 110 times

Uneven chromosome contraction and expansion in the maize genome

Maize (Zea mays or corn), both a major food source and an important cytogenetic model, evolved from a tetraploid that arose about 4.8 million years ago (Mya). As a result, maize has extensive duplicated regions within its genome. We have sequenced the two copies of one such region, generating 7.8 Mb of sequence spanning 17.4 cM of the short arm of chromosome 1 and 6.6 Mb (25.6 cM) from the long arm of chromosome 9. Rice, which did not undergo a similar whole genome duplication event, has only one orthologous region (4.9 Mb) on the short arm of chromosome 3, and can be used as reference for the maize homoeologous regions. Alignment of the three regions allowed identification of syntenic blocks, and indicated that the maize regions have undergone differential contraction in genic and intergenic regions and expansion by the insertion of retrotransposable elements. Approximately 9% of the predicted genes in each duplicated region are completely missing in the rice genome, and almost 20% have moved to other genomic locations. Predicted genes within these regions tend to be larger in maize than in rice, primarily because of the presence of predicted genes in maize with larger introns. Interestingly, the general gene methylation patterns in the maize homoeologous regions do not appear to have changed with contraction or expansion of their chromosomes. In addition, no differences in methylation of single genes and tandemly repeated gene copies have been detected. These results, therefore, provide new insights into the diploidization of polyploid species.

DOI: 10.1038/nature04406

Cited 110 times

DNA sequence and analysis of human chromosome 8

The finished sequence for human chromosome 8 is now published. It features a 15-megabase stretch that has a much greater mutation rate in hominids than the corresponding region in other mammals. Included in this region are genes influencing brain size and the immune system. The International Human Genome Sequencing Consortium (IHGSC) recently completed a sequence of the human genome1. As part of this project, we have focused on chromosome 8. Although some chromosomes exhibit extreme characteristics in terms of length, gene content, repeat content and fraction segmentally duplicated, chromosome 8 is distinctly typical in character, being very close to the genome median in each of these aspects. This work describes a finished sequence and gene catalogue for the chromosome, which represents just over 5% of the euchromatic human genome. A unique feature of the chromosome is a vast region of ∼15 megabases on distal 8p that appears to have a strikingly high mutation rate, which has accelerated in the hominids relative to other sequenced mammals. This fast-evolving region contains a number of genes related to innate immunity and the nervous system, including loci that appear to be under positive selection2—these include the major defensin (DEF) gene cluster3,4 and MCPH15,6, a gene that may have contributed to the evolution of expanded brain size in the great apes. The data from chromosome 8 should allow a better understanding of both normal and disease biology and genome evolution.

DOI: 10.1186/gb-2010-11-2-r15

Cited 101 times

A scalable, fully automated process for construction of sequence-ready barcoded libraries for 454

We present an automated, high throughput library construction process for 454 technology. Sample handling errors and cross-contamination are minimized via end-to-end barcoding of plasticware, along with molecular DNA barcoding of constructs. Automation-friendly magnetic bead-based size selection and cleanup steps have been devised, eliminating major bottlenecks and significant sources of error. Using this methodology, one technician can create 96 sequence-ready 454 libraries in 2 days, a dramatic improvement over the standard method.

DOI: 10.1186/gb-2012-13-4-r27

Genome-wide identification and characterization of replication origins by deep sequencing

DNA replication initiates at distinct origins in eukaryotic genomes, but the genomic features that define these sites are not well understood.We have taken a combined experimental and bioinformatic approach to identify and characterize origins of replication in three distantly related fission yeasts: Schizosaccharomyces pombe, Schizosaccharomyces octosporus and Schizosaccharomyces japonicus. Using single-molecule deep sequencing to construct amplification-free high-resolution replication profiles, we located origins and identified sequence motifs that predict origin function. We then mapped nucleosome occupancy by deep sequencing of mononucleosomal DNA from the corresponding species, finding that origins tend to occupy nucleosome-depleted regions.The sequences that specify origins are evolutionarily plastic, with low complexity nucleosome-excluding sequences functioning in S. pombe and S. octosporus, and binding sites for trans-acting nucleosome-excluding proteins functioning in S. japonicus. Furthermore, chromosome-scale variation in replication timing is conserved independently of origin location and via a mechanism distinct from known heterochromatic effects on origin function. These results are consistent with a model in which origins are simply the nucleosome-depleted regions of the genome with the highest affinity for the origin recognition complex. This approach provides a general strategy for understanding the mechanisms that define DNA replication origins in eukaryotes.

DOI: 10.1261/rna.055095.115

Use of the MS2 aptamer and coat protein for RNA localization in yeast: A response to “MS2 coat proteins bound to yeast mRNAs block 5′ to 3′ degradation and trap mRNA decay products: implications for the localization of mRNAs by MS2-MCP system”

The MS2 system has been extensively used to visualize single mRNA molecules in live cells and follow their localization and behavior. In their Letter to the Editor recently published, Garcia and Parker suggest that use of the MS2 system may yield erroneous mRNA localization results due to the accumulation of 3' decay products. Here we cite published works and provide new data which demonstrate that this is not a phenomenon general to endogenously expressed MS2-tagged transcripts, and that some of the results obtained in their study could have arisen from artifacts of gene expression.

DOI: 10.1172/jci.insight.132508

Th17 reprogramming of T cells in systemic juvenile idiopathic arthritis

Systemic juvenile idiopathic arthritis (sJIA) begins with fever, rash, and high-grade systemic inflammation but commonly progresses to a persistent afebrile arthritis. The basis for this transition is unknown. To evaluate a role for lymphocyte polarization, we characterized T cells from patients with acute and chronic sJIA using flow cytometry, mass cytometry, and RNA sequencing. Acute and chronic sJIA each featured an expanded population of activated Tregs uncommon in healthy controls or in children with nonsystemic JIA. In acute sJIA, Tregs expressed IL-17A and a gene expression signature reflecting Th17 polarization. In chronic sJIA, the Th17 transcriptional signature was identified in T effector cells (Teffs), although expression of IL-17A at the protein level remained rare. Th17 polarization was abrogated in patients responding to IL-1 blockade. These findings identify evolving Th17 polarization in sJIA that begins in Tregs and progresses to Teffs, likely reflecting the impact of the cytokine milieu and consistent with a biphasic model of disease pathogenesis. The results support T cells as a potential treatment target in sJIA.

DOI: 10.1038/11962

Cited 110 times

Radiation hybrid map of the mouse genome

DOI: 10.1093/genetics/122.3.579

The Caenorhabditis elegans gene sdc-2 controls sex determination and dosage compensation in XX animals.

Abstract We have identified a new X-linked gene, sdc-2, that controls the hermaphrodite (XX) modes of both sex determination and X chromosome dosage compensation in Caenorhabditis elegans. Mutations in sdc-2 cause phenotypes that appear to result from a shift of both the sex determination and dosage compensation processes in XX animals to the XO modes of expression. Twenty-eight independent sdc-2 mutations have no apparent effect in XO animals, but cause two distinct phenotypes in XX animals: masculinization, reflecting a defect in sex determination, and lethality or dumpiness, reflecting a disruption in dosage compensation. The dosage compensation defect can be demonstrated directly by showing that sdc-2 mutations cause elevated levels of several X-linked transcripts in XX but not XO animals. While the masculinization is blocked by mutations in sex determining genes required for male development (her-1 and fem-3), the lethality, dumpiness and overexpression of X-linked genes are not, indicating that the effect of sdc-2 mutations on sex determination and dosage compensation are ultimately implemented by two independent pathways. We propose a model in which sdc-2 is involved in the coordinate control of both sex determination and dosage compensation in XX animals and acts in the regulatory hierarchy at a step prior to the divergence of the two pathways.

DOI: 10.1371/journal.pone.0009083

Analysis of High-Throughput Sequencing and Annotation Strategies for Phage Genomes

Bacterial viruses (phages) play a critical role in shaping microbial populations as they influence both host mortality and horizontal gene transfer. As such, they have a significant impact on local and global ecosystem function and human health. Despite their importance, little is known about the genomic diversity harbored in phages, as methods to capture complete phage genomes have been hampered by the lack of knowledge about the target genomes, and difficulties in generating sufficient quantities of genomic DNA for sequencing. Of the approximately 550 phage genomes currently available in the public domain, fewer than 5% are marine phage.To advance the study of phage biology through comparative genomic approaches we used marine cyanophage as a model system. We compared DNA preparation methodologies (DNA extraction directly from either phage lysates or CsCl purified phage particles), and sequencing strategies that utilize either Sanger sequencing of a linker amplification shotgun library (LASL) or of a whole genome shotgun library (WGSL), or 454 pyrosequencing methods. We demonstrate that genomic DNA sample preparation directly from a phage lysate, combined with 454 pyrosequencing, is best suited for phage genome sequencing at scale, as this method is capable of capturing complete continuous genomes with high accuracy. In addition, we describe an automated annotation informatics pipeline that delivers high-quality annotation and yields few false positives and negatives in ORF calling.These DNA preparation, sequencing and annotation strategies enable a high-throughput approach to the burgeoning field of phage genomics.

DOI: 10.1111/j.1365-294x.2010.04472.x

Key considerations for measuring allelic expression on a genomic scale using high‐throughput sequencing

Differences in gene expression are thought to be an important source of phenotypic diversity, so dissecting the genetic components of natural variation in gene expression is important for understanding the evolutionary mechanisms that lead to adaptation. Gene expression is a complex trait that, in diploid organisms, results from transcription of both maternal and paternal alleles. Directly measuring allelic expression rather than total gene expression offers greater insight into regulatory variation. The recent emergence of high-throughput sequencing offers an unprecedented opportunity to study allelic transcription at a genomic scale for virtually any species. By sequencing transcript pools derived from heterozygous individuals, estimates of allelic expression can be directly obtained. The statistical power of this approach is influenced by the number of transcripts sequenced and the ability to unambiguously assign individual sequence fragments to specific alleles on the basis of transcribed nucleotide polymorphisms. Here, using mathematical modelling and computer simulations, we determine the minimum sequencing depth required to accurately measure relative allelic expression and detect allelic imbalance via high-throughput sequencing under a variety of conditions. We conclude that, within a species, a minimum of 500-1000 sequencing reads per gene are needed to test for allelic imbalance, and consequently, at least five to 10 millions reads are required for studying a genome expressing 10 000 genes. Finally, using 454 sequencing, we illustrate an application of allelic expression by testing for cis-regulatory divergence between closely related Drosophila species.

DOI: 10.1038/nmeth.1286

Sensitive, specific polymorphism discovery in bacteria using massively parallel sequencing

This variant ascertainment algorithm, or VAAL, uses short sequence reads of haploid bacterial genomes to first locally assemble the reads and then compare these assemblies to the reference genome. This allows VAAL to detect all types of variants ranging from single-nucleotide polymorphisms to large insertions or deletions. Our variant ascertainment algorithm, VAAL, uses massively parallel DNA sequence data to identify differences between bacterial genomes with high sensitivity and specificity. VAAL detected ∼98% of differences (including large insertion-deletions) between pairs of strains from three species while calling no false positives. VAAL also pinpointed a single mutation between Vibrio cholerae genomes, identifying an antibiotic's site of action by identifying sequence differences between drug-sensitive strains and drug-resistant derivatives.

DOI: 10.1111/j.1365-294x.2009.04192.x

Molecular phylogenetic biodiversity assessment of arctic and boreal ectomycorrhizal <i>Lactarius</i> Pers. (Russulales; Basidiomycota) in Alaska, based on soil and sporocarp DNA

Despite the critical roles fungi play in the functioning of ecosystems, especially as symbionts of plants and recyclers of organic matter, their biodiversity is poorly known in high-latitude regions. In this paper, we discuss the molecular diversity of one of the most diverse and abundant groups of ectomycorrhizal fungi: the genus Lactarius Pers. We analysed internal transcribed spacer rDNA sequences from both curated sporocarp collections and soil polymerase chain reaction clone libraries sampled in the arctic tundra and boreal forests of Alaska. Our genetic diversity assessment, based on various phylogenetic methods and operational taxonomic unit (OTU) delimitations, suggests that the genus Lactarius is diverse in Alaska, with at least 43 putative phylogroups, and 24 and 38 distinct OTUs based on 95% and 97% internal transcribed spacer sequence similarity, respectively. Some OTUs were identified to known species, while others were novel, previously unsequenced groups. Non-asymptotic species accumulation curves, the disparity between observed and estimated richness, and the high number of singleton OTUs indicated that many Lactarius species remain to be found and identified in Alaska. Many Lactarius taxa show strong habitat preference to one of the three major vegetation types in the sampled regions (arctic tundra, black spruce forests, and mixed birch-aspen-white spruce forests), as supported by statistical tests of UniFrac distances and principal coordinates analyses (PCoA). Together, our data robustly demonstrate great diversity and nonrandom ecological partitioning in an important boreal ectomycorrhizal genus within a relatively small geographical region. The observed diversity of Lactarius was much higher in either type of boreal forest than in the arctic tundra, supporting the widely recognized pattern of decreasing species richness with increasing latitude.

DOI: 10.1101/gr.138925.112

Paired-end sequencing of Fosmid libraries by Illumina

Eliminating the bacterial cloning step has been a major factor in the vastly improved efficiency of massively parallel sequencing approaches. However, this also has made it a technical challenge to produce the modern equivalent of the Fosmid- or BAC-end sequences that were crucial for assembling and analyzing complex genomes during the Sanger-based sequencing era. To close this technology gap, we developed Fosill, a method for converting Fosmids to Illumina-compatible jumping libraries. We constructed Fosmid libraries in vectors with Illumina primer sequences and specific nicking sites flanking the cloning site. Our family of pFosill vectors allows multiplex Fosmid cloning of end-tagged genomic fragments without physical size selection and is compatible with standard and multiplex paired-end Illumina sequencing. To excise the bulk of each cloned insert, we introduced two nicks in the vector, translated them into the inserts, and cleaved them. Recircularization of the vector via coligation of insert termini followed by inverse PCR generates a jumping library for paired-end sequencing with 101-base reads. The yield of unique Fosmid-sized jumps is sufficiently high, and the background of short, incorrectly spaced and chimeric artifacts sufficiently low, to enable applications such as mapping of structural variation and scaffolding of de novo assemblies. We demonstrate the power of Fosill to map genome rearrangements in a cancer cell line and identified three fusion genes that were corroborated by RNA-seq data. Our Fosill-powered assembly of the mouse genome has an N50 scaffold length of 17.0 Mb, rivaling the connectivity (16.9 Mb) of the Sanger-sequencing based draft assembly.

DOI: 10.1038/11967

A YAC-based physical map of the mouse genome

DOI: 10.1038/nature04601

Analysis of the DNA sequence and duplication history of human chromosome 15

Here we present a finished sequence of human chromosome 15, together with a high-quality gene catalogue. As chromosome 15 is one of seven human chromosomes with a high rate of segmental duplication, we have carried out a detailed analysis of the duplication structure of the chromosome. Segmental duplications in chromosome 15 are largely clustered in two regions, on proximal and distal 15q; the proximal region is notable because recombination among the segmental duplications can result in deletions causing Prader-Willi and Angelman syndromes. Sequence analysis shows that the proximal and distal regions of 15q share extensive ancient similarity. Using a simple approach, we have been able to reconstruct many of the events by which the current duplication structure arose. We find that most of the intrachromosomal duplications seem to share a common ancestry. Finally, we demonstrate that some remaining gaps in the genome sequence are probably due to structural polymorphisms between haplotypes; this may explain a significant fraction of the gaps remaining in the human genome.

DOI: 10.1038/nature04632

Human chromosome 11 DNA sequence and analysis including novel gene identification

Chromosome 11, although average in size, is one of the most gene- and disease-rich chromosomes in the human genome. Initial gene annotation indicates an average gene density of 11.6 genes per megabase, including 1,524 protein-coding genes, some of which were identified using novel methods, and 765 pseudogenes. One-quarter of the protein-coding genes shows overlap with other genes. Of the 856 olfactory receptor genes in the human genome, more than 40% are located in 28 single- and multi-gene clusters along this chromosome. Out of the 171 disorders currently attributed to the chromosome, 86 remain for which the underlying molecular basis is not yet known, including several mendelian traits, cancer and susceptibility loci. The high-quality data presented here--nearly 134.5 million base pairs representing 99.8% coverage of the euchromatic sequence--provide scientists with a solid foundation for understanding the genetic basis of these disorders and other biological phenomena.

DOI: 10.1074/jbc.m111.317370

Yeast Sterol Regulatory Element-binding Protein (SREBP) Cleavage Requires Cdc48 and Dsc5, a Ubiquitin Regulatory X Domain-containing Subunit of the Golgi Dsc E3 Ligase

Schizosaccharomyces pombe Sre1 is a membrane-bound transcription factor that controls adaptation to hypoxia. Like its mammalian homolog, sterol regulatory element-binding protein (SREBP), Sre1 activation requires release from the membrane. However, in fission yeast, this release occurs through a strikingly different mechanism that requires the Golgi Dsc E3 ubiquitin ligase complex and the proteasome. The mechanistic details of Sre1 cleavage, including the link between the Dsc E3 ligase complex and proteasome, are not well understood. Here, we present results of a genetic selection designed to identify additional components required for Sre1 cleavage. From the selection, we identified two new components of the fission yeast SREBP pathway: Dsc5 and Cdc48. The AAA (ATPase associated with diverse cellular activities) ATPase Cdc48 and Dsc5, a ubiquitin regulatory X domain-containing protein, interact with known Dsc complex components and are required for SREBP cleavage. These findings provide a mechanistic link between the Dsc E3 ligase complex and the proteasome in SREBP cleavage and add to a growing list of similarities between the Dsc E3 ligase and membrane E3 ligases involved in endoplasmic reticulum-associated degradation.

DOI: 10.1038/35087627

Correction: Initial sequencing and analysis of the human genome

transfection, cells were collected and processed for CAT or luciferase activity using standard techniques 14 . GST pull downs and immunoprecipitationsGST±Rb (wild type and mutants 15 ) and other GST fusion proteins were expressed and puri®ed from Escherichia coli XA90 (ref.16).GST fusion proteins that were immobilized on glutathione-sepharose (Pharmacia), or H3-derived peptides 3 bound to Sulfolink beads (Pierce), were incubated with extract in IPH buffer 16 .Complexes were washed four times in IPH buffer before processing for methylase assays or western blotting.Antibodies against HA (12CA5, Roche), Gal4±DBD (DNA-binding domain; sc-510, Santa Cruz), SUV39H1 (M.Cleary), Rb (G3-245; XZ55, PharMingen) or HP1 (ref.3) were used.For immunoprecipitations antibodies were incubated with HeLa nuclear extract (Cell Culture Center) or U2OS nuclear extract in IPH buffer at 4 8C (ref.17).After 2 h a 50:50 mixture of protein A/G-sepharose beads (Pharmacia) was added.To avoid the possibility that DNA mediates the interaction between SUV39H1 and Rb, the immunoprecipitations were probed for the presence of histones with negative results. Histone methylase assays and protein sequencingPrecipitations from pull downs or immunoprecipitations were incubated with 20 mg histones (Sigma) and 1 ml [3H-Me]-S-adenosyl methionine (NEN, 80 Ci mmol -1 ) in PBS at 30 8C for 1 h.Assays were analysed by SDS±PAGE followed by western blotting and autoradiography or were spotted onto P-81 cationic exchange paper (Whatman), washed in carbonate buffer and quanti®ed by scintillation counting 3 .For amino-terminal sequencing, radiolabelled H3 was blotted to polyvinylidene ¯uoride and sequenced by Edman degradation (Protein Sequencing Facility, University of Cambridge).We counted fractions for the presence of tritium. RNA puri®cation and RT-PCR analysisTotal RNA (0.5 mg) was isolated from W12 (wild type) and D3 (SUV39H1 and SUV39H2 double knockout; D.O. and T.J., unpublished observations) female mouse cells, and was used for quantitative RT-PCR, following the Qiagen One Step protocol, for 20, 25 and 30 PCR cycles. Antibody generationRabbits were immunized with a H3 N-terminal lysine-methylated peptide corresponding to amino acids 1±16.Immunoreactive serum was applied to a H3 Lys-9-methylated peptide column to af®nity purify speci®c antibodies, as the antiserum crossreacted with H3 methylated at Lys 4. Chromatin immunoprecipitationChromatin immunoprecipitations were performed using HeLa cells and MEF cells essentially as described 18,19 .Immunoprecipitates were analysed for the presence of cyclin E or Cdc25C promoter fragments by PCR using primers speci®c for single nucleosomes.PCR reactions were repeated exhaustively using varying cycle numbers and different amounts of templates to ensure that results were within the linear range of the PCR.

DOI: 10.1111/j.1755-0998.2008.02094.x

Increasing ecological inference from high throughput sequencing of fungi in the environment through a tagging approach

Abstract High throughput sequencing methods are widely used in analyses of microbial diversity, but are generally applied to small numbers of samples, which precludes characterization of patterns of microbial diversity across space and time. We have designed a primer‐tagging approach that allows pooling and subsequent sorting of numerous samples, which is directed to amplification of a region spanning the nuclear ribosomal internal transcribed spacers and partial large subunit from fungi in environmental samples. To test the method for phylogenetic biases, we constructed a controlled mixture of four taxa representing the Chytridiomycota, Zygomycota, Ascomycota and Basidiomycota. Following cloning and colony restriction fragment length polymorphism analysis, we found no significant difference in representation in 19 of the 23 tested primers. We also generated a clone library from two soil DNA extracts using two primers for each extract and compared 456 clone sequences. Community diversity statistics and contingency table tests applied to counts of operational taxonomic units revealed that the two DNA extracts differed significantly, while the pairs of tagged primers from each extract were indistinguishable. Similar results were obtained using UniFrac phylogenetic comparisons. Together, these results suggest that the pig‐tagged primers can be used to increase ecological inference in high throughput sequencing projects on fungi.

DOI: 10.1128/jvi.01466-17

Early Epstein-Barr Virus Genomic Diversity and Convergence toward the B95.8 Genome in Primary Infection

ABSTRACT Over 90% of the world's population is persistently infected with Epstein-Barr virus. While EBV does not cause disease in most individuals, it is the common cause of acute infectious mononucleosis (AIM) and has been associated with several cancers and autoimmune diseases, highlighting a need for a preventive vaccine. At present, very few primary, circulating EBV genomes have been sequenced directly from infected individuals. While low levels of diversity and low viral evolution rates have been predicted for double-stranded DNA (dsDNA) viruses, recent studies have demonstrated appreciable diversity in common dsDNA pathogens (e.g., cytomegalovirus). Here, we report 40 full-length EBV genome sequences obtained from matched oral wash and B cell fractions from a cohort of 10 AIM patients. Both intra- and interpatient diversity were observed across the length of the entire viral genome. Diversity was most pronounced in viral genes required for establishing latent infection and persistence, with appreciable levels of diversity also detected in structural genes, including envelope glycoproteins. Interestingly, intrapatient diversity declined significantly over time ( P < 0.01), and this was particularly evident on comparison of viral genomes sequenced from B cell fractions in early primary infection and convalescence ( P < 0.001). B cell-associated viral genomes were observed to converge, becoming nearly identical to the B95.8 reference genome over time (Spearman rank-order correlation test; r = −0.5589, P = 0.0264). The reduction in diversity was most marked in the EBV latency genes. In summary, our data suggest independent convergence of diverse viral genome sequences toward a reference-like strain within a relatively short period following primary EBV infection. IMPORTANCE Identification of viral proteins with low variability and high immunogenicity is important for the development of a protective vaccine. Knowledge of genome diversity within circulating viral populations is a key step in this process, as is the expansion of intrahost genomic variation during infection. We report full-length EBV genomes sequenced from the blood and oral wash of 10 individuals early in primary infection and during convalescence. Our data demonstrate considerable diversity within the pool of circulating EBV strains, as well as within individual patients. Overall viral diversity decreased from early to persistent infection, particularly in latently infected B cells, which serve as the viral reservoir. Reduction in B cell-associated viral genome diversity coincided with a convergence toward a reference-like EBV genotype. Greater convergence positively correlated with time after infection, suggesting that the reference-like genome is the result of selection.

DOI: 10.1186/s13072-016-0100-6

Systematic comparison of monoclonal versus polyclonal antibodies for mapping histone modifications by ChIP-seq

The robustness of ChIP-seq datasets is highly dependent upon the antibodies used. Currently, polyclonal antibodies are the standard despite several limitations: They are non-renewable, vary in performance between lots and need to be validated with each new lot. In contrast, monoclonal antibody lots are renewable and provide consistent performance. To increase ChIP-seq standardization, we investigated whether monoclonal antibodies could replace polyclonal antibodies. We compared monoclonal antibodies that target five key histone modifications (H3K4me1, H3K4me3, H3K9me3, H3K27ac and H3K27me3) to their polyclonal counterparts in both human and mouse cells. Overall performance was highly similar for four monoclonal/polyclonal pairs, including when we used two distinct lots of the same monoclonal antibody. In contrast, the binding patterns for H3K27ac differed substantially between polyclonal and monoclonal antibodies. However, this was most likely due to the distinct immunogen used rather than the clonality of the antibody. Altogether, we found that monoclonal antibodies as a class perform equivalently to polyclonal antibodies for the detection of histone post-translational modifications in both human and mouse. Accordingly, we recommend the use of monoclonal antibodies in ChIP-seq experiments.

DOI: 10.1172/jci125375

TGF-β signaling underlies hematopoietic dysfunction and bone marrow failure in Shwachman-Diamond syndrome

Shwachman-Diamond Syndrome (SDS) is a rare and clinically-heterogeneous bone marrow (BM) failure syndrome caused by mutations in the Shwachman-Bodian-Diamond Syndrome (SBDS) gene. Although SDS was described over 50 years ago, the molecular pathogenesis is poorly understood due, in part, to the rarity and heterogeneity of the affected hematopoietic progenitors. To address this, we used single cell RNA sequencing to profile scant hematopoietic stem and progenitor cells from SDS patients. We generated a single cell map of early lineage commitment and found that SDS hematopoiesis was left-shifted with selective loss of granulocyte-monocyte progenitors. Transcriptional targets of transforming growth factor-beta (TGFβ) were dysregulated in SDS hematopoietic stem cells and multipotent progenitors, but not in lineage-committed progenitors. TGFβ inhibitors (AVID200 and SD208) increased hematopoietic colony formation of SDS patient BM. Finally, TGFβ3 and other TGFβ pathway members were elevated in SDS patient blood plasma. These data establish the TGFβ pathway as a novel candidate biomarker and therapeutic target in SDS and translate insights from single cell biology into a potential therapy.

DOI: 10.7554/elife.66050

Multiplexed mRNA assembly into ribonucleoprotein particles plays an operon-like role in the control of yeast cell physiology

Prokaryotes utilize polycistronic messages (operons) to co-translate proteins involved in the same biological processes. Whether eukaryotes achieve similar regulation by selectively assembling and translating monocistronic messages derived from different chromosomes is unknown. We employed transcript-specific RNA pulldowns and RNA-seq/RT-PCR to identify yeast mRNAs that co-precipitate as ribonucleoprotein (RNP) complexes. Consistent with the hypothesis of eukaryotic RNA operons, mRNAs encoding components of the mating pathway, heat shock proteins, and mitochondrial outer membrane proteins multiplex in trans, forming discrete messenger ribonucleoprotein (mRNP) complexes (called transperons). Chromatin capture and allele tagging experiments reveal that genes encoding multiplexed mRNAs physically interact; thus, RNA assembly may result from co-regulated gene expression. Transperon assembly and function depends upon histone H4, and its depletion leads to defects in RNA multiplexing, decreased pheromone responsiveness and mating, and increased heat shock sensitivity. We propose that intergenic associations and non-canonical histone H4 functions contribute to transperon formation in eukaryotic cells and regulate cell physiology.

DOI: 10.1038/nature03983

DNA sequence and analysis of human chromosome 18

Chromosome 18 appears to have the lowest gene density of any human chromosome and is one of only three chromosomes for which trisomic individuals survive to term. There are also a number of genetic disorders stemming from chromosome 18 trisomy and aneuploidy. Here we report the finished sequence and gene annotation of human chromosome 18, which will allow a better understanding of the normal and disease biology of this chromosome. Despite the low density of protein-coding genes on chromosome 18, we find that the proportion of non-protein-coding sequences evolutionarily conserved among mammals is close to the genome-wide average. Extending this analysis to the entire human genome, we find that the density of conserved non-protein-coding sequences is largely uncorrelated with gene density. This has important implications for the nature and roles of non-protein-coding sequence elements.