ϟ

Naomi Altman

Here are all the papers by Naomi Altman that you can download and read on OA.mg.
Naomi Altman’s last known institution is . Download Naomi Altman PDFs here.

Claim this Profile →
DOI: 10.1080/00031305.1992.10475879
1992
Cited 2,427 times
An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression
Abstract Abstract Nonparametric regression is a set of techniques for estimating a regression curve without making strong assumptions about the shape of the true regression function. These techniques are therefore useful for building and checking parametric models, as well as for data description. Kernel and nearest-neighbor regression estimators are local versions of univariate location estimators, and so they can readily be introduced to beginning students and consulting clients who are familiar with such summaries as the sample mean and median. Key Words: Confidence intervalsLocal linear regressionModel buildingModel checkingSmoothing
DOI: 10.2307/2685209
1992
Cited 916 times
An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression
DOI: 10.1038/nmeth.4642
2018
Cited 869 times
Statistics versus machine learning
Statistics draws population inferences from a sample, and machine learning finds generalizable predictive patterns.
DOI: 10.1038/nmeth.4370
2017
Cited 863 times
Classification and regression trees
DOI: 10.1038/nmeth.4346
2017
Cited 841 times
Principal component analysis
PCA helps you interpret your data, but it will not always find the important patterns.
DOI: 10.1126/science.1241089
2013
Cited 726 times
The <i>Amborella</i> Genome and the Evolution of Flowering Plants
Shaping Plant Evolution Amborella trichopoda is understood to be the most basal extant flowering plant and its genome is anticipated to provide insights into the evolution of plant life on Earth (see the Perspective by Adams ). To validate and assemble the sequence, Chamala et al. (p. 1516 ) combined fluorescent in situ hybridization (FISH), genomic mapping, and next-generation sequencing. The Amborella Genome Project (p. 10.1126/science.1241089 ) was able to infer that a whole-genome duplication event preceded the evolution of this ancestral angiosperm, and Rice et al. (p. 1468 ) found that numerous genes in the mitochondrion were acquired by horizontal gene transfer from other plants, including almost four entire mitochondrial genomes from mosses and algae.
DOI: 10.1038/nmeth.2813
2014
Cited 436 times
Visualizing samples with box plots
Use box plots to illustrate the spread and differences of samples.
DOI: 10.1038/nmeth.3968
2016
Cited 430 times
Model selection and overfitting
DOI: 10.1038/nmeth.3945
2016
Cited 261 times
Classification evaluation
It is important to understand both what a classification metric expresses and what it hides.
DOI: 10.1038/s41592-018-0019-x
2018
Cited 233 times
The curse(s) of dimensionality
There is such a thing as too much of a good thing.
DOI: 10.1038/nmeth.2659
2013
Cited 223 times
Error bars
The meaning of error bars is often misinterpreted, as is the statistical significance of their overlap.
DOI: 10.1038/nmeth.3587
2015
Cited 222 times
Association, correlation and causation
Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.
DOI: 10.1038/s41583-020-0313-3
2020
Cited 214 times
Reproducibility of animal research in light of biological variation
Context-dependent biological variation presents a unique challenge to the reproducibility of results in experimental animal research, because organisms’ responses to experimental treatments can vary with both genotype and environmental conditions. In March 2019, experts in animal biology, experimental design and statistics convened in Blonay, Switzerland, to discuss strategies addressing this challenge. In contrast to the current gold standard of rigorous standardization in experimental animal research, we recommend the use of systematic heterogenization of study samples and conditions by actively incorporating biological variation into study design through diversifying study samples and conditions. Here we provide the scientific rationale for this approach in the hope that researchers, regulators, funders and editors can embrace this paradigm shift. We also present a road map towards better practices in view of improving the reproducibility of animal research. In this Perspective, Hanno Würbel and colleagues argue that a disregard for incorporating biological variation in study design is an important cause of poor reproducibility in animal research. They put the case for the use of systematic heterogenization of study samples and conditions in studies to improve reproducibility.
DOI: 10.1038/nmeth.4551
2018
Cited 200 times
Machine learning: supervised methods
Supervised learning algorithms extract general principles from observed examples guided by a specific prediction objective.
DOI: 10.1038/nmeth.4438
2017
Cited 139 times
Ensemble methods: bagging and random forests
Many heads are better than one.
DOI: 10.1038/s41592-020-0856-2
2020
Cited 123 times
The SEIRS model for infectious disease dynamics
Realistic models of epidemics account for latency, loss of immunity, births and deaths.
DOI: 10.1093/molbev/msj051
2005
Cited 278 times
Expression Pattern Shifts Following Duplication Indicative of Subfunctionalization and Neofunctionalization in Regulatory Genes of Arabidopsis
Gene duplication plays an important role in the evolution of diversity and novel function and is especially prevalent in the nuclear genomes of flowering plants. Duplicate genes may be maintained through subfunctionalization and neofunctionalization at the level of expression or coding sequence. In order to test the hypothesis that duplicated regulatory genes will be differentially expressed in a specific manner indicative of regulatory subfunctionalization and/or neofunctionalization, we examined expression pattern shifts in duplicated regulatory genes in Arabidopsis. A two-way analysis of variance was performed on expression data for 280 phylogenetically identified paralogous pairs. Expression data were extracted from global expression profiles for wild-type root, stem, leaf, developing inflorescence, nearly mature flower buds, and seedpod. Gene, organ, and gene by organ interaction (G x O) effects were examined. Results indicate that 85% of the paralogous pairs exhibited a significant G x O effect indicative of regulatory subfunctionalization and/or neofunctionalization. A significant G x O effect was associated with complementary expression patterns in 45% of pairwise comparisons. No association was detected between a G x O effect and a relaxed evolutionary constraint as detected by the ratio of nonsynonymous to synonymous substitutions. Ancestral gene expression patterns inferred across a Type II MADS-box gene phylogeny suggest several cases of regulatory neofunctionalization and organ-specific nonfunctionalization. Complete linkage clustering of gene expression levels across organs suggests that regulatory modules for each organ are independent or ancestral genes had limited expression. We propose a new classification, regulatory hypofunctionalization, for an overall decrease in expression level in one member of a paralogous pair while still having a significant G x O effect. We conclude that expression divergence specifically indicative of subfunctionalization and/or neofunctionalization contributes to the maintenance of most if not all duplicated regulatory genes in Arabidopsis and hypothesize that this results in increasing expression diversity or specificity of regulatory genes after each round of duplication.
DOI: 10.1104/pp.104.040436
2004
Cited 249 times
Genome-Wide Analysis of the Cyclin Family in Arabidopsis and Comparative Phylogenetic Analysis of Plant Cyclin-Like Proteins
Abstract Cyclins are primary regulators of the activity of cyclin-dependent kinases, which are known to play critical roles in controlling eukaryotic cell cycle progression. While there has been extensive research on cell cycle mechanisms and cyclin function in animals and yeasts, only a small number of plant cyclins have been characterized functionally. In this paper, we describe an exhaustive search for cyclin genes in the Arabidopsis genome and among available sequences from other vascular plants. Based on phylogenetic analysis, we define 10 classes of plant cyclins, four of which are plant-specific, and a fifth is shared between plants and protists but not animals. Microarray and reverse transcriptase-polymerase chain reaction analyses further provide expression profiles of cyclin genes in different tissues of wild-type Arabidopsis plants. Comparative phylogenetic studies of 174 plant cyclins were also performed. The phylogenetic results imply that the cyclin gene family in plants has experienced more gene duplication events than in animals. Expression patterns and phylogenetic analyses of Arabidopsis cyclin genes suggest potential gene redundancy among members belonging to the same group. We discuss possible divergence and conservation of some plant cyclins. Our study provides an opportunity to rapidly assess the position of plant cyclin genes in terms of evolution and classification, serving as a guide for further functional study of plant cyclins.
DOI: 10.1080/01621459.1990.10474936
1990
Cited 219 times
Kernel Smoothing of Data with Correlated Errors
Abstract Kernel smoothing is a common method of estimating the mean function in the nonparametric regression model y = f(x) + ε, where f(x) is a smooth deterministic mean function and ε is an error process with mean zero. In this article, the mean squared error of kernel estimators is computed for processes with correlated errors, and the estimators are shown to be consistent when the sequence of error processes converges to a mixing sequence. The standard techniques for bandwidth selection, such as cross-validation and generalized cross-validation, are shown to perform very badly when the errors are correlated. Standard selection techniques are shown to favor undersmoothing when the correlations are predominantly positive and oversmoothing when negative. The selection criteria can, however, be adjusted to correct for the effect of correlation. In simulations, the standard selection criteria are shown to behave as predicted. The corrected criteria are shown to be very effective when the correlation function is known. Estimates of correlation based on the data are shown, by simulation, to be sufficiently good for correcting the selection criteria, particularly if the signal to noise ratio is small.
DOI: 10.1186/1471-2164-10-347
2009
Cited 193 times
Comparison of next generation sequencing technologies for transcriptome characterization
Abstract Background We have developed a simulation approach to help determine the optimal mixture of sequencing methods for most complete and cost effective transcriptome sequencing. We compared simulation results for traditional capillary sequencing with "Next Generation" (NG) ultra high-throughput technologies. The simulation model was parameterized using mappings of 130,000 cDNA sequence reads to the Arabidopsis genome (NCBI Accession SRA008180.19). We also generated 454-GS20 sequences and de novo assemblies for the basal eudicot California poppy ( Eschscholzia californica ) and the magnoliid avocado ( Persea americana ) using a variety of methods for cDNA synthesis. Results The Arabidopsis reads tagged more than 15,000 genes, including new splice variants and extended UTR regions. Of the total 134,791 reads (13.8 MB), 119,518 (88.7%) mapped exactly to known exons, while 1,117 (0.8%) mapped to introns, 11,524 (8.6%) spanned annotated intron/exon boundaries, and 3,066 (2.3%) extended beyond the end of annotated UTRs. Sequence-based inference of relative gene expression levels correlated significantly with microarray data. As expected, NG sequencing of normalized libraries tagged more genes than non-normalized libraries, although non-normalized libraries yielded more full-length cDNA sequences. The Arabidopsis data were used to simulate additional rounds of NG and traditional EST sequencing, and various combinations of each. Our simulations suggest a combination of FLX and Solexa sequencing for optimal transcriptome coverage at modest cost. We have also developed ESTcalc http://fgp.huck.psu.edu/NG_Sims/ngsim.pl , an online webtool, which allows users to explore the results of this study by specifying individualized costs and sequencing characteristics. Conclusion NG sequencing technologies are a highly flexible set of platforms that can be scaled to suit different project goals. In terms of sequence coverage alone, the NG sequencing is a dramatic advance over capillary-based sequencing, but NG sequencing also presents significant challenges in assembly and sequence accuracy due to short read lengths, method-specific sequencing errors, and the absence of physical clones. These problems may be overcome by hybrid sequencing strategies using a mixture of sequencing methodologies, by new assemblers, and by sequencing more deeply. Sequencing and microarray outcomes from multiple experiments suggest that our simulator will be useful for guiding NG transcriptome sequencing projects in a wide range of organisms.
DOI: 10.1038/nmeth.2738
2013
Cited 170 times
Power and sample size
The ability to detect experimental effects is undermined in studies that lack power.
DOI: 10.1016/0378-3758(94)00102-2
1995
Cited 163 times
Bandwidth selection for kernel distribution function estimation
Leave-one-out cross-validation is a popular and readily implemented heuristic for bandwidth selection in nonparametric smoothing problems. In this note we elucidate the role of leave-one-out selection criteria by discussing a criterion introduced by Sarda (J. Statist. Plann. Inference 35 (1993) 65–75) for bandwidth selection for kernel distribution function estimators (KDFEs). We show that for this problem, use of the leave-one-out KDFE in the selection procedure is asymptotically equivalent to leaving none out. This contrasts with kernel density estimation, where use of the leave-one-out density estimator in the selection procedure is critical. Unfortunately, simulations show that neither method works in practice, even for samples of size as large as 1000. In fact, we show that for any fixed bandwidth, the expected value of the derivative of the leave-none-out criterion is asymptotically positive. This result and our simulations suggest that the criteria are increasing and that for sufficiently large samples (e.g., n = 100), the smallest available bandwidth will always be selected, thus contradicting the optimality result of Sarda for this estimator. As an alternative to minimizing a selection criterion, we propose a plug-in estimator of the asymptotically optimal bandwidth. Simulations suggest that the plug-in is a good estimator of the asymptotically optimal bandwidth even for samples as small as 10 observations and is not too far from the finite sample bandwidth.
DOI: 10.1038/nmeth.3091
2014
Cited 152 times
Replication
Quality is often more important than quantity.
DOI: 10.1093/molbev/msu343
2014
Cited 137 times
Comparative Transcriptome Analyses Reveal Core Parasitism Genes and Suggest Gene Duplication and Repurposing as Sources of Structural Novelty
The origin of novel traits is recognized as an important process underlying many major evolutionary radiations. We studied the genetic basis for the evolution of haustoria, the novel feeding organs of parasitic flowering plants, using comparative transcriptome sequencing in three species of Orobanchaceae. Around 180 genes are upregulated during haustorial development following host attachment in at least two species, and these are enriched in proteases, cell wall modifying enzymes, and extracellular secretion proteins. Additionally, about 100 shared genes are upregulated in response to haustorium inducing factors prior to host attachment. Collectively, we refer to these newly identified genes as putative "parasitism genes." Most of these parasitism genes are derived from gene duplications in a common ancestor of Orobanchaceae and Mimulus guttatus, a related nonparasitic plant. Additionally, the signature of relaxed purifying selection and/or adaptive evolution at specific sites was detected in many haustorial genes, and may play an important role in parasite evolution. Comparative analysis of gene expression patterns in parasitic and nonparasitic angiosperms suggests that parasitism genes are derived primarily from root and floral tissues, but with some genes co-opted from other tissues. Gene duplication, often taking place in a nonparasitic ancestor of Orobanchaceae, followed by regulatory neofunctionalization, was an important process in the origin of parasitic haustoria.
DOI: 10.1038/nmeth.2698
2013
Cited 134 times
Significance, P values and t-tests
The P value reported by tests is a probabilistic significance, not a biological one.
DOI: 10.1038/nmeth.4526
2017
Cited 134 times
Machine learning: a primer
Machine learning extracts patterns from data without explicit instructions.
DOI: 10.1371/journal.pone.0146062
2016
Cited 97 times
Selecting Superior De Novo Transcriptome Assemblies: Lessons Learned by Leveraging the Best Plant Genome
Whereas de novo assemblies of RNA-Seq data are being published for a growing number of species across the tree of life, there are currently no broadly accepted methods for evaluating such assemblies. Here we present a detailed comparison of 99 transcriptome assemblies, generated with 6 de novo assemblers including CLC, Trinity, SOAP, Oases, ABySS and NextGENe. Controlled analyses of de novo assemblies for Arabidopsis thaliana and Oryza sativa transcriptomes provide new insights into the strengths and limitations of transcriptome assembly strategies. We find that the leading assemblers generate reassuringly accurate assemblies for the majority of transcripts. At the same time, we find a propensity for assemblers to fail to fully assemble highly expressed genes. Surprisingly, the instance of true chimeric assemblies is very low for all assemblers. Normalized libraries are reduced in highly abundant transcripts, but they also lack 1000s of low abundance transcripts. We conclude that the quality of de novo transcriptome assemblies is best assessed through consideration of a combination of metrics: 1) proportion of reads mapping to an assembly 2) recovery of conserved, widely expressed genes, 3) N50 length statistics, and 4) the total number of unigenes. We provide benchmark Illumina transcriptome data and introduce SCERNA, a broadly applicable modular protocol for de novo assembly improvement. Finally, our de novo assembly of the Arabidopsis leaf transcriptome revealed ~20 putative Arabidopsis genes lacking in the current annotation.
DOI: 10.1038/nmeth.4120
2017
Cited 97 times
P values and the search for significance
Little P value What are you trying to say Of significance? —Steve Ziliak
DOI: 10.1038/nmeth.4299
2017
Cited 95 times
Clustering
Clustering finds patterns in data—whether they are there or not.
DOI: 10.1038/srep15667
2015
Cited 93 times
Intraspecific diversity among partners drives functional variation in coral symbioses
The capacity of coral-dinoflagellate mutualisms to adapt to a changing climate relies in part on standing variation in host and symbiont populations, but rarely have the interactions between symbiotic partners been considered at the level of individuals. Here, we tested the importance of inter-individual variation with respect to the physiology of coral holobionts. We identified six genetically distinct Acropora palmata coral colonies that all shared the same isoclonal Symbiodinium 'fitti' dinoflagellate strain. No other Symbiodinium could be detected in host tissues. We exposed fragments of each colony to extreme cold and found that the stress-induced change in symbiont photochemical efficiency varied up to 3.6-fold depending on host genetic background. The S. 'fitti' strain was least stressed when associating with hosts that significantly altered the expression of 184 genes under cold shock; it was most stressed in hosts that only adjusted 14 genes. Key expression differences among hosts were related to redox signaling and iron availability pathways. Fine-scale interactions among unique host colonies and symbiont strains provide an underappreciated source of raw material for natural selection in coral symbioses.
DOI: 10.1038/nmeth.3665
2015
Cited 93 times
Multiple linear regression
When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple.
DOI: 10.1038/nmeth.3627
2015
Cited 92 times
Simple linear regression
“The statistician knows...that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world.”1
DOI: 10.1038/nmeth.3414
2015
Cited 90 times
Sampling distributions and the bootstrap
The bootstrap can be used to assess uncertainty of sample estimates.
DOI: 10.1073/pnas.1608765113
2016
Cited 85 times
Horizontal gene transfer is more frequent with increased heterotrophy and contributes to parasite adaptation
Significance Horizontal gene transfer (HGT) is the nonsexual transfer and genomic integration of genetic materials between organisms. In eukaryotes, HGT appears rare, but parasitic plants may be exceptions, as haustorial feeding connections between parasites and their hosts provide intimate cellular contacts that could facilitate DNA transfer between unrelated species. Through analysis of genome-scale data, we identified &gt;50 expressed and likely functional HGT events in one family of parasitic plants. HGT reflected parasite preferences for different host plants and was much more frequent in plants with increasing parasitic dependency. HGT was strongly biased toward expression and protein types likely to contribute to haustorial function, suggesting that functional HGT of host genes may play an important role in adaptive evolution of parasites.
DOI: 10.1038/s41592-020-0822-z
2020
Cited 85 times
Modeling infectious epidemics
“Every day sadder and sadder news of its increase. In the City died this week 7496; and of them, 6102 of the plague. But it is feared that the true number of the dead this week is near 10,000 ....” —Samuel Pepys, 1665
DOI: 10.1038/nmeth.3904
2016
Cited 84 times
Logistic regression
DOI: 10.1038/s41592-018-0083-2
2018
Cited 77 times
Optimal experimental design
Customize the experiment for the setting instead of adjusting the setting to fit a classical design.
DOI: 10.1038/s41477-019-0458-0
2019
Cited 74 times
Convergent horizontal gene transfer and cross-talk of mobile nucleic acids in parasitic plants
DOI: 10.1890/0012-9658(2002)083[1831:ibhfal]2.0.co;2
2002
Cited 128 times
INTERACTIONS BETWEEN HERBIVOROUS FISHES AND LIMITING NUTRIENTS IN A TROPICAL STREAM ECOSYSTEM
Ecologists have long been interested in understanding the strengths of consumer and resource limitation in influencing communities. Here we ask three questions concerning the relative importance of nutrients and grazing fishes to primary producers of a tropical Andean stream: (1) Are stream algae nutrient limited? (2) Are top-down and bottom-up forces of dual importance in limiting primary producers? (3) Do grazing fishes modulate the degree of resource limitation? We obtained several lines of evidence suggesting that Andean stream algae are nitrogen limited. Addition of nitrogen in flow-through channels resulted in major increases in algal standing crop, whereas there were no measurable effects of phosphorus enrichment. Interestingly, the N2-fixing cyanobacteria Anabaena was one of the taxa that responded most dramatically to the addition of nitrogen. Moreover, nutrient uptake rates were significantly higher for inorganic nitrogen (NO3-N and NH4-N) compared to phosphorus (PO4-P). Nutrients and the presence of grazing fishes were manipulated simultaneously in a series of experiments by using nutrient-diffusing substrates in fish exclusions vs. open cages accessible to the natural fish assemblage. We observed strong effects of both nitrogen addition and consumers on algal standing crop, although consumer limitation was found to be of considerably greater magnitude than resource limitation in influencing algal biomass and composition. Finally, the degree of resource limitation varied as a consequence of grazing fishes. Experiments examining nutrient limitation in the presence and absence of fishes showed that the response to nitrogen enrichment was significantly greater on substrates accessible to natural fish assemblages compared to substrates where grazing fishes were excluded. These experiments demonstrate simultaneous and interactive effects of top-down and bottom-up factors in limiting primary producers of tropical Andean streams. Whereas other studies have shown that consumers affect nutrient supply in ecosystems, our findings suggest that consumers can play an important role in influencing nutrient demand.
DOI: 10.1094/phyto-81-539
1991
Cited 126 times
Aphid Transmission of Barley Yellow Dwarf Virus: Acquisition Access Periods and Virus Concentration Requirements
The duration of access periods and the availability of virus in source plants are two factors that influence the transmission of barley yellow dwarf virus (BYDV) by its aphid vectors. This study was conducted to quantify the relationships among acquisition access period (AAP), virus titer in infected oats, and transmission of three isolates of BYDV from New York by two aphid vector species. Thirteen AAPs, ranging from 15 min to 72 hr, were examined, and virus titer was quantified from each virus source leaf by using enzyme-linked immunosorbent assay (ELISA) (...)
DOI: 10.1016/j.tplants.2007.06.012
2007
Cited 106 times
The floral genome: an evolutionary history of gene duplication and shifting patterns of gene expression
Through multifaceted genome-scale research involving phylogenomics, targeted gene surveys, and gene expression analyses in diverse basal lineages of angiosperms, our studies provide insights into the most recent common ancestor of all extant flowering plants. MADS-box gene duplications have played an important role in the origin and diversification of angiosperms. Furthermore, early angiosperms possessed a diverse tool kit of floral genes and exhibited developmental ‘flexibility’, with broader patterns of expression of key floral organ identity genes than are found in eudicots. In particular, homologs of B-function MADS-box genes are more broadly expressed across the floral meristem in basal lineages. These results prompted formulation of the ‘fading borders’ model, which states that the gradual transitions in floral organ morphology observed in some basal angiosperms (e.g. Amborella) result from a gradient in the level of expression of floral organ identity genes across the developing floral meristem. Through multifaceted genome-scale research involving phylogenomics, targeted gene surveys, and gene expression analyses in diverse basal lineages of angiosperms, our studies provide insights into the most recent common ancestor of all extant flowering plants. MADS-box gene duplications have played an important role in the origin and diversification of angiosperms. Furthermore, early angiosperms possessed a diverse tool kit of floral genes and exhibited developmental ‘flexibility’, with broader patterns of expression of key floral organ identity genes than are found in eudicots. In particular, homologs of B-function MADS-box genes are more broadly expressed across the floral meristem in basal lineages. These results prompted formulation of the ‘fading borders’ model, which states that the gradual transitions in floral organ morphology observed in some basal angiosperms (e.g. Amborella) result from a gradient in the level of expression of floral organ identity genes across the developing floral meristem.
DOI: 10.1007/s00338-013-1012-6
2013
Cited 86 times
Genotypic variation influences reproductive success and thermal stress tolerance in the reef building coral, Acropora palmata
DOI: 10.1214/09-aos737
2010
Cited 83 times
On dimension folding of matrix- or array-valued statistical objects
We consider dimension reduction for regression or classification in which the predictors are matrix-or array-valued.This type of predictor arises when measurements are obtained for each combination of two or more underlying variables-for example, the voltage measured at different channels and times in electroencephalography data.For these applications, it is desirable to preserve the array structure of the reduced predictor (e.g., time versus channel), but this cannot be achieved within the conventional dimension reduction formulation.In this paper, we introduce a dimension reduction method, to be called dimension folding, for matrix-and array-valued predictors that preserves the array structure.In an application of dimension folding to an electroencephalography data set, we correctly classify 97 out of 122 subjects as alcoholic or nonalcoholic based on their electroencephalography in a crossvalidation sample.
DOI: 10.1038/nmeth.2613
2013
Cited 78 times
Importance of being uncertain
Statistics does not tell us whether we are right. It tells us the chances of being wrong.
DOI: 10.1038/nmeth.3335
2015
Cited 73 times
Bayes' theorem
Incorporate new evidence to update prior information.
DOI: 10.1038/nmeth.3812
2016
Cited 67 times
Analyzing outliers: influential or nuisance?
Some outliers influence the regression fit more than others.
DOI: 10.1038/nmeth.2937
2014
Cited 66 times
Nonparametric tests
Nonparametric tests robustly compare skewed or ranked data.
DOI: 10.1186/1471-2164-13-9
2012
Cited 66 times
Rootstock-regulated gene expression patterns associated with fire blight resistance in apple
Desirable apple varieties are clonally propagated by grafting vegetative scions onto rootstocks. Rootstocks influence many phenotypic traits of the scion, including resistance to pathogens such as Erwinia amylovora, which causes fire blight, the most serious bacterial disease of apple. The purpose of the present study was to quantify rootstock-mediated differences in scion fire blight susceptibility and to identify transcripts in the scion whose expression levels correlated with this response.Rootstock influence on scion fire blight resistance was quantified by inoculating three-year old, orchard-grown apple trees, consisting of 'Gala' scions grafted to a range of rootstocks, with E. amylovora. Disease severity was measured by the extent of shoot necrosis over time. 'Gala' scions grafted to G.30 or MM.111 rootstocks showed the lowest rates of necrosis, while 'Gala' on M.27 and B.9 showed the highest rates of necrosis. 'Gala' scions on M.7, S.4 or M.9F56 had intermediate necrosis rates. Using an apple DNA microarray representing 55,230 unique transcripts, gene expression patterns were compared in healthy, un-inoculated, greenhouse-grown 'Gala' scions on the same series of rootstocks. We identified 690 transcripts whose steady-state expression levels correlated with the degree of fire blight susceptibility of the scion/rootstock combinations. Transcripts known to be differentially expressed during E. amylovora infection were disproportionately represented among these transcripts. A second-generation apple microarray representing 26,000 transcripts was developed and was used to test these correlations in an orchard-grown population of trees segregating for fire blight resistance. Of the 690 transcripts originally identified using the first-generation array, 39 had expression levels that correlated with fire blight resistance in the breeding population.Rootstocks had significant effects on the fire blight susceptibility of 'Gala' scions, and rootstock-regulated gene expression patterns could be correlated with differences in susceptibility. The results suggest a relationship between rootstock-regulated fire blight susceptibility and sorbitol dehydrogenase, phenylpropanoid metabolism, protein processing in the endoplasmic reticulum, and endocytosis, among others. This study illustrates the utility of our rootstock-regulated gene expression data sets for candidate trait-associated gene data mining.
DOI: 10.1038/nmeth.4210
2017
Cited 66 times
Interpreting P values
A P value measures a sample's compatibility with a hypothesis, not the truth of the hypothesis.
DOI: 10.1186/1471-2229-13-9
2013
Cited 64 times
Functional genomics of a generalist parasitic plant: Laser microdissection of host-parasite interface reveals host-specific patterns of parasite gene expression
Orobanchaceae is the only plant family with members representing the full range of parasitic lifestyles plus a free-living lineage sister to all parasitic lineages, Lindenbergia. A generalist member of this family, and an important parasitic plant model, Triphysaria versicolor regularly feeds upon a wide range of host plants. Here, we compare de novo assembled transcriptomes generated from laser micro-dissected tissues at the host-parasite interface to uncover details of the largely uncharacterized interaction between parasitic plants and their hosts.The interaction of Triphysaria with the distantly related hosts Zea mays and Medicago truncatula reveals dramatic host-specific gene expression patterns. Relative to above ground tissues, gene families are disproportionally represented at the interface including enrichment for transcription factors and genes of unknown function. Quantitative Real-Time PCR of a T. versicolor β-expansin shows strong differential (120x) upregulation in response to the monocot host Z. mays; a result that is concordant with our read count estimates. Pathogenesis-related proteins, other cell wall modifying enzymes, and orthologs of genes with unknown function (annotated as such in sequenced plant genomes) are among the parasite genes highly expressed by T. versicolor at the parasite-host interface.Laser capture microdissection makes it possible to sample the small region of cells at the epicenter of parasite host interactions. The results of our analysis suggest that T. versicolor's generalist strategy involves a reliance on overlapping but distinct gene sets, depending upon the host plant it is parasitizing. The massive upregulation of a T. versicolor β-expansin is suggestive of a mechanism for parasite success on grass hosts. In this preliminary study of the interface transcriptomes, we have shown that T. versicolor, and the Orobanchaceae in general, provide excellent opportunities for the characterization of plant genes with unknown functions.
DOI: 10.1111/mec.12163
2013
Cited 61 times
Variation in the transcriptional response of threatened coral larvae to elevated temperatures
Abstract Coral populations have declined worldwide largely due to increased sea surface temperatures. Recovery of coral populations depends in part upon larval recruitment. Many corals reproduce during the warmest time of year when further increases in temperature can lead to low fertilization rates of eggs and high larval mortality. Microarray experiments were designed to capture and assess variability in the thermal stress responses of A cropora palmata larvae from Puerto Rico. Transcription profiles showed a striking acceleration of normal developmental gene expression patterns with increased temperature. The transcriptional response to heat suggested rapid depletion of larval energy stores via peroxisomal lipid oxidation and included key enzymes that indicated the activation of the glyoxylate cycle. High temperature also resulted in expression differences in key developmental signalling genes including the conserved WNT pathway that is critical for pattern formation and tissue differentiation in developing embryos. Expression of these and other important developmental and thermal stress genes such as ferritin, heat shock proteins, cytoskeletal components, cell adhesion and autophagy proteins also varied among larvae derived from different parent colonies. Disruption of normal developmental and metabolic processes will have negative impacts on larval survival and dispersal as temperatures rise. However, it appears that variation in larval response to high temperature remains despite the dramatic population declines. Further research is needed to determine whether this variation is heritable or attributable to maternal effects.
DOI: 10.1038/nmeth.2900
2014
Cited 57 times
Comparing samples—part II
When a large number of tests are performed, P values must be interpreted differently.
DOI: 10.1038/nmeth.3005
2014
Cited 57 times
Analysis of variance and blocking
Good experimental designs mitigate experimental error and the impact of factors not under study.
DOI: 10.1038/nmeth.3854
2016
Cited 57 times
Regression diagnostics
Residual plots can be used to validate assumptions about the regression model.
DOI: 10.1038/nmeth.2858
2014
Cited 57 times
Comparing samples—part I
Robustly comparing pairs of independent or related samples requires different approaches to the t-test.
DOI: 10.1038/nmeth.3293
2015
Cited 55 times
Split plot design
When some factors are harder to vary than others, a split plot design can be efficient.
DOI: 10.1038/s41592-019-0406-y
2019
Cited 54 times
Quantile regression
DOI: 10.1038/s41592-021-01302-4
2021
Cited 34 times
The class imbalance problem
DOI: 10.1038/s41592-020-01036-9
2021
Cited 29 times
The standardization fallacy
“We demand rigidly defined areas of doubt and uncertainty!” —D. Adams
DOI: 10.1016/0168-1605(90)90013-u
1990
Cited 90 times
Growth of Listeria monocytogenes Scott A, serotype 4 and competitive spoilage organisms in raw chicken packaged under modified atmospheres and in air
The development of Listeria monocytogenes Scott A, serotype 4 and aerobic plate counts on miced raw chicken were determined independently at 4, 10, and 27°C. Samples were packaged in flexible film under two modified atmospheres (one containing oxygen and one containing no oxygen) or air. The anaerobic modified atmosphere (75:25, CO2: N2) resulted in the failure of both the aerobic plate counts and L. monocytogenes to grow at all temperatures. Both the L. monocytogenes and aerobic plate counts grew in air at all temperatures. The aerobic modified atmosphere (72.5:22.5:5, CO2:N2:O2), which more closely duplicates commercial practice, inhibited the increase in aerobic plate counts by more than 4 log10 cfu/g compared to air at 4°C. However, the L. monocytogenes was not affected by this atmosphere and increased in numbers by nearly 6 log10 cfu/g at 4°C in 21 days. Regression analysis of the log10 growth and 95% confidence intervals showed that the differences between aerobic plate counts and L. monocytogenes in modified atmosphere were large. The ability of L. monocytogenes to grow in the aerobic modified atmosphere was not affected by level of the L. monocytogenes inoculum nor by the initial level of aerobic plate counts. These data show that modified atmosphere packaging of raw chicken (and probably other meats) can substantially inhibit the aerobic spoilage flora while allowing pathogenic L. monocytogenes to increase.
DOI: 10.1093/nar/gkm972
2007
Cited 80 times
PlantTribes: a gene and gene family resource for comparative genomics in plants
The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575-1584)] to classify all of these species' protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting approximately 4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study.
DOI: 10.1007/s11295-009-0228-7
2009
Cited 79 times
Rootstock-regulated gene expression patterns in apple tree scions
DOI: 10.1186/1471-2105-8-294
2007
Cited 78 times
Quantitative sequence-function relationships in proteins based on gene ontology
The relationship between divergence of amino-acid sequence and divergence of function among homologous proteins is complex. The assumption that homologs share function--the basis of transfer of annotations in databases--must therefore be regarded with caution. Here, we present a quantitative study of sequence and function divergence, based on the Gene Ontology classification of function. We determined the relationship between sequence divergence and function divergence in 6828 protein families from the PFAM database. Within families there is a broad range of sequence similarity from very closely related proteins--for instance, orthologs in different mammals--to very distantly-related proteins at the limit of reliable recognition of homology.We correlated the divergence in sequences determined from pairwise alignments, and the divergence in function determined by path lengths in the Gene Ontology graph, taking into account the fact that many proteins have multiple functions. Our results show that, among homologous proteins, the proportion of divergent functions decreases dramatically above a threshold of sequence similarity at about 50% residue identity. For proteins with more than 50% residue identity, transfer of annotation between homologs will lead to an erroneous attribution with a totally dissimilar function in fewer than 6% of cases. This means that for very similar proteins (about 50 % identical residues) the chance of completely incorrect annotation is low; however, because of the phenomenon of recruitment, it is still non-zero.Our results describe general features of the evolution of protein function, and serve as a guide to the reliability of annotation transfer, based on the closeness of the relationship between a new protein and its nearest annotated relative.
DOI: 10.1073/pnas.0902417106
2009
Cited 74 times
Transcriptome of embryonic and neonatal mouse cortex by high-throughput RNA sequencing
Brain structure and function experience dramatic changes from embryonic to postnatal development. Microarray analyses have detected differential gene expression at different stages and in disease models, but gene expression information during early brain development is limited. We have generated >27 million reads to identify mRNAs from the mouse cortex for >16,000 genes at either embryonic day 18 (E18) or postnatal day 7 (P7), a period of significant synaptogenesis for neural circuit formation. In addition, we devised strategies to detect alternative splice forms and uncovered more splice variants. We observed differential expression of 3,758 genes between the 2 stages, many with known functions or predicted to be important for neural development. Neurogenesis-related genes, such as those encoding Sox4, Sox11, and zinc-finger proteins, were more highly expressed at E18 than at P7. In contrast, the genes encoding synaptic proteins such as synaptotagmin, complexin 2, and syntaxin were up-regulated from E18 to P7. We also found that several neurological disorder-related genes were highly expressed at E18. Our transcriptome analysis may serve as a blueprint for gene expression pattern and provide functional clues of previously unknown genes and disease-related genes during early brain development.
DOI: 10.1073/pnas.1013395108
2010
Cited 66 times
Conservation and canalization of gene expression during angiosperm diversification accompany the origin and evolution of the flower
The origin and rapid diversification of the angiosperms (Darwin's "Abominable Mystery") has engaged generations of researchers. Here, we examine the floral genetic programs of phylogenetically pivotal angiosperms (water lily, avocado, California poppy, and Arabidopsis) and a nonflowering seed plant (a cycad) to obtain insight into the origin and subsequent evolution of the flower. Transcriptional cascades with broadly overlapping spatial domains, resembling the hypothesized ancestral gymnosperm program, are deployed across morphologically intergrading organs in water lily and avocado flowers. In contrast, spatially discrete transcriptional programs in distinct floral organs characterize the more recently derived angiosperm lineages represented by California poppy and Arabidopsis. Deep evolutionary conservation in the genetic programs of putatively homologous floral organs traces to those operating in gymnosperm reproductive cones. Female gymnosperm cones and angiosperm carpels share conserved genetic features, which may be associated with the ovule developmental program common to both organs. However, male gymnosperm cones share genetic features with both perianth (sterile attractive and protective) organs and stamens, supporting the evolutionary origin of the floral perianth from the male genetic program of seed plants.
DOI: 10.1038/nmeth.3137
2014
Cited 49 times
Nested designs
For studies with hierarchical noise sources, use a nested analysis of variance approach.
DOI: 10.1038/s41598-017-01005-x
2017
Cited 44 times
Novel metrics to measure coverage in whole exome sequencing datasets reveal local and global non-uniformity
Abstract Whole Exome Sequencing (WES) is a powerful clinical diagnostic tool for discovering the genetic basis of many diseases. A major shortcoming of WES is uneven coverage of sequence reads over the exome targets contributing to many low coverage regions, which hinders accurate variant calling. In this study, we devised two novel metrics, Cohort Coverage Sparseness (CCS) and Unevenness (U E ) Scores for a detailed assessment of the distribution of coverage of sequence reads. Employing these metrics we revealed non-uniformity of coverage and low coverage regions in the WES data generated by three different platforms. This non-uniformity of coverage is both local (coverage of a given exon across different platforms) and global (coverage of all exons across the genome in the given platform). The low coverage regions encompassing functionally important genes were often associated with high GC content, repeat elements and segmental duplications. While a majority of the problems associated with WES are due to the limitations of the capture methods, further refinements in WES technologies have the potential to enhance its clinical applications.
DOI: 10.1098/rsos.160576
2016
Cited 40 times
Chemical communication is not sufficient to explain reproductive inhibition in the bumblebee <i>Bombus impatiens</i>
Reproductive division of labour is a hallmark of eusociality, but disentangling the underlying proximate mechanisms can be challenging. In bumblebees, workers isolated from the queen can activate their ovaries and lay haploid, male eggs. We investigated if volatile, contact, visual or behavioural cues produced by the queen or brood mediate reproductive dominance in Bombus impatiens. Exposure to queen-produced volatiles, brood-produced volatiles and direct contact with pupae did not reduce worker ovary activation; only direct contact with the queen could reduce ovary activation. We evaluated behaviour, physiology and gene expression patterns in workers that were reared in chambers with all stages of brood and a free queen, caged queen (where workers could contact the queen, but the queen was unable to initiate interactions) or no queen. Workers housed with a caged queen or no queen fully activated their ovaries, whereas ovary activation in workers housed with a free queen was completely inhibited. The caged queen marginally reduced worker aggression and expression of an aggression-associated gene relative to queenless workers. Thus, queen-initiated behavioural interactions appear necessary to establish reproductive dominance. Queen-produced chemical cues may function secondarily in a context-specific manner to augment behavioural cues, as reliable or honest signal.
DOI: 10.1038/s41592-023-01973-1
2023
Cited 6 times
Convolutional neural networks
DOI: 10.1007/s11103-005-5434-6
2005
Cited 70 times
Genome-wide expression profiling and identification of gene activities during early flower development in Arabidopsis
DOI: 10.1186/gb-2008-9-3-402
2008
Cited 62 times
The Amborella genome: an evolutionary reference for plant biology
The nuclear genome sequence of Amborella trichopoda, the sister species to all other extant angiosperms, will be an exceptional resource for plant genomics.
DOI: 10.2307/2290011
1990
Cited 61 times
Kernel Smoothing of Data With Correlated Errors
Abstract Kernel smoothing is a common method of estimating the mean function in the nonparametric regression model y = f(x) + ε, where f(x) is a smooth deterministic mean function and ε is an error process with mean zero. In this article, the mean squared error of kernel estimators is computed for processes with correlated errors, and the estimators are shown to be consistent when the sequence of error processes converges to a mixing sequence. The standard techniques for bandwidth selection, such as cross-validation and generalized cross-validation, are shown to perform very badly when the errors are correlated. Standard selection techniques are shown to favor undersmoothing when the correlations are predominantly positive and oversmoothing when negative. The selection criteria can, however, be adjusted to correct for the effect of correlation. In simulations, the standard selection criteria are shown to behave as predicted. The corrected criteria are shown to be very effective when the correlation function is known. Estimates of correlation based on the data are shown, by simulation, to be sufficiently good for correcting the selection criteria, particularly if the signal to noise ratio is small.
DOI: 10.1073/pnas.0811476106
2009
Cited 50 times
Transcriptional signatures of ancient floral developmental genetics in avocado ( <i>Persea americana</i> ; Lauraceae)
The debate on the origin and evolution of flowers has recently entered the field of developmental genetics, with focus on the design of the ancestral floral regulatory program. Flowers can differ dramatically among angiosperm lineages, but in general, male and female reproductive organs surrounded by a sterile perianth of sepals and petals constitute the basic floral structure. However, the basal angiosperm lineages exhibit spectacular diversity in the number, arrangement, and structure of floral organs, whereas the evolutionarily derived monocot and eudicot lineages share a far more uniform floral ground plan. Here we show that broadly overlapping transcriptional programs characterize the floral transcriptome of the basal angiosperm Persea americana (avocado), whereas floral gene expression domains are considerably more organ specific in the model eudicot Arabidopsis thaliana . Our findings therefore support the “fading borders” model for organ identity determination in basal angiosperm flowers and extend it from the action of regulatory genes to downstream transcriptional programs. Furthermore, the declining expression of components of the staminal transcriptome in central and peripheral regions of Persea flowers concurs with elements of a previous hypothesis for developmental regulation in a gymnosperm “floral progenitor.” Accordingly, in contrast to the canalized organ-specific regulatory apparatus of Arabidopsis , floral development may have been originally regulated by overlapping transcriptional cascades with fading gradients of influence from focal to bordering organs.
DOI: 10.1038/nmeth.4014
2016
Cited 31 times
Regularization
Constraining the magnitude of parameters of a model can control its complexity
DOI: 10.1038/nmeth.3180
2014
Cited 29 times
Two-factor designs
When multiple factors can affect a system, allowing for interaction can increase sensitivity.
DOI: 10.1038/nmeth.3550
2015
Cited 29 times
Bayesian networks
For making probabilistic inferences, a graph is worth a thousand words.
DOI: 10.1038/s41592-019-0369-z
2019
Cited 23 times
Analyzing outliers: robust methods to the rescue
DOI: 10.1017/s0016672307008476
2006
Cited 45 times
Extending the loop design for two-channel microarray experiments
The loop design of Kerr and Churchill is a clever application of incomplete blocks of size 2 to two-channel microarray experiments. In this paper, we extend the loop design to include more replicates, biological and technical replication, multi-factor experiments, and blocking. Loop and extended loop designs are shown to be more efficient than the reference design for any given number of arrays. We also show that adding new treatments to a loop design requires the same number of additional arrays as adding treatments to a reference design, with a greater gain in power. Given the flexibility of extended loop designs and their power, we propose that these should be the designs of choice for most experiments using two-channel microarrays.
DOI: 10.2165/00822942-200504010-00004
2005
Cited 44 times
Replication, Variation and Normalisation in Microarray Experiments
DOI: 10.1186/gb-2010-11-10-r101
2010
Cited 32 times
Comparative transcriptomics among floral organs of the basal eudicot Eschscholzia californica as reference for floral evolutionary developmental studies
Molecular genetic studies of floral development have concentrated on several core eudicots and grasses (monocots), which have canalized floral forms. Basal eudicots possess a wider range of floral morphologies than the core eudicots and grasses and can serve as an evolutionary link between core eudicots and monocots, and provide a reference for studies of other basal angiosperms. Recent advances in genomics have enabled researchers to profile gene activities during floral development, primarily in the eudicot Arabidopsis thaliana and the monocots rice and maize. However, our understanding of floral developmental processes among the basal eudicots remains limited. Using a recently generated expressed sequence tag (EST) set, we have designed an oligonucleotide microarray for the basal eudicot Eschscholzia californica (California poppy). We performed microarray experiments with an interwoven-loop design in order to characterize the E. californica floral transcriptome and to identify differentially expressed genes in flower buds with pre-meiotic and meiotic cells, four floral organs at pre-anthesis stages (sepals, petals, stamens and carpels), developing fruits, and leaves. Our results provide a foundation for comparative gene expression studies between eudicots and basal angiosperms. We identified whorl-specific gene expression patterns in E. californica and examined the floral expression of several gene families. Interestingly, most E. californica homologs of Arabidopsis genes important for flower development, except for genes encoding MADS-box transcription factors, show different expression patterns between the two species. Our comparative transcriptomics study highlights the unique evolutionary position of E. californica compared with basal angiosperms and core eudicots.
DOI: 10.1186/1471-2105-14-165
2013
Cited 29 times
Separate-channel analysis of two-channel microarrays: recovering inter-spot information
Two-channel (or two-color) microarrays are cost-effective platforms for comparative analysis of gene expression. They are traditionally analysed in terms of the log-ratios (M-values) of the two channel intensities at each spot, but this analysis does not use all the information available in the separate channel observations. Mixed models have been proposed to analyse intensities from the two channels as separate observations, but such models can be complex to use and the gain in efficiency over the log-ratio analysis is difficult to quantify. Mixed models yield test statistics for the null distributions can be specified only approximately, and some approaches do not borrow strength between genes.This article reformulates the mixed model to clarify the relationship with the traditional log-ratio analysis, to facilitate information borrowing between genes, and to obtain an exact distributional theory for the resulting test statistics. The mixed model is transformed to operate on the M-values and A-values (average log-expression for each spot) instead of on the log-expression values. The log-ratio analysis is shown to ignore information contained in the A-values. The relative efficiency of the log-ratio analysis is shown to depend on the size of the intraspot correlation. A new separate channel analysis method is proposed that assumes a constant intra-spot correlation coefficient across all genes. This approach permits the mixed model to be transformed into an ordinary linear model, allowing the data analysis to use a well-understood empirical Bayes analysis pipeline for linear modeling of microarray data. This yields statistically powerful test statistics that have an exact distributional theory. The log-ratio, mixed model and common correlation methods are compared using three case studies. The results show that separate channel analyses that borrow strength between genes are more powerful than log-ratio analyses. The common correlation analysis is the most powerful of all.The common correlation method proposed in this article for separate-channel analysis of two-channel microarray data is no more difficult to apply in practice than the traditional log-ratio analysis. It provides an intuitive and powerful means to conduct analyses and make comparisons that might otherwise not be possible.
DOI: 10.1038/nmeth.3224
2014
Cited 25 times
Sources of variation
To generalize conclusions to a population, we must sample its variation.
DOI: 10.1007/s10886-017-0858-4
2017
Cited 25 times
Do Bumble Bee, Bombus impatiens, Queens Signal their Reproductive and Mating Status to their Workers?
DOI: 10.1038/nmeth.3368
2015
Cited 24 times
Bayesian statistics
DOI: 10.1038/s41592-019-0532-6
2019
Cited 22 times
Markov models — hidden Markov models
DOI: 10.1038/s41592-019-0476-x
2019
Cited 20 times
Markov models—Markov chains
DOI: 10.1002/env.888
2007
Cited 37 times
Regression with spatially misaligned data
Abstract We present a simple approach to the problem of estimating the regression slope parameter from spatially misaligned point data. We assume a linear regression model with errors and covariates from two independent Gaussian spatial processes where covariate and response are observed at different locations. Correlation in the covariate is exploited to predict unobserved covariates via kriging. Kriged values are used to find weighted least squares estimates of regression parameters in a ‘krige‐and‐regress’ (KR) procedure. The variance of this estimator is calculated, and a variance estimator is proposed. Because the model and assumptions make it possible to write down the joint likelihood of the data, a maximum likelihood (ML) estimator can be found. Under regularity conditions, this estimator is asymptotically normal with asymptotic variance given by the inverse information matrix, which yields a variance estimator for the ML estimator of the regression parameters. The KR and ML estimators are compared in an example using Environmental Protection Agency data and a simulation study is conducted. While the ML estimator of the slope parameter has a smaller variance than the KR estimator, the ML variance estimator is too small to be used for inference whereas the KR variance estimator gives approximately correct inference. Copyright © 2007 John Wiley &amp; Sons, Ltd.
DOI: 10.1038/s41592-023-02119-z
2024
Errors in predictor variables
DOI: 10.1186/s12915-024-01831-2
2024
A combination of conserved and diverged responses underlies Theobroma cacao’s defense response to Phytophthora palmivora
Plants have complex and dynamic immune systems that have evolved to resist pathogens. Humans have worked to enhance these defenses in crops through breeding. However, many crops harbor only a fraction of the genetic diversity present in wild relatives. Increased utilization of diverse germplasm to search for desirable traits, such as disease resistance, is therefore a valuable step towards breeding crops that are adapted to both current and emerging threats. Here, we examine diversity of defense responses across four populations of the long-generation tree crop Theobroma cacao L., as well as four non-cacao Theobroma species, with the goal of identifying genetic elements essential for protection against the oomycete pathogen Phytophthora palmivora.We began by creating a new, highly contiguous genome assembly for the P. palmivora-resistant genotype SCA 6 (Additional file 1: Tables S1-S5), deposited in GenBank under accessions CP139290-CP139299. We then used this high-quality assembly to combine RNA and whole-genome sequencing data to discover several genes and pathways associated with resistance. Many of these are unique, i.e., differentially regulated in only one of the four populations (diverged 40 k-900 k generations). Among the pathways shared across all populations is phenylpropanoid biosynthesis, a metabolic pathway with well-documented roles in plant defense. One gene in this pathway, caffeoyl shikimate esterase (CSE), was upregulated across all four populations following pathogen treatment, indicating its broad importance for cacao's defense response. Further experimental evidence suggests this gene hydrolyzes caffeoyl shikimate to create caffeic acid, an antimicrobial compound and known inhibitor of Phytophthora spp.Our results indicate most expression variation associated with resistance is unique to populations. Moreover, our findings demonstrate the value of using a broad sample of evolutionarily diverged populations for revealing the genetic bases of cacao resistance to P. palmivora. This approach has promise for further revealing and harnessing valuable genetic resources in this and other long-generation plants.
DOI: 10.1111/j.1365-313x.2010.04357.x
2010
Cited 22 times
Evolutionary trends in the floral transcriptome: insights from one of the basalmost angiosperms, the water lily Nuphar advena (Nymphaeaceae)
Current understanding of floral developmental genetics comes primarily from the core eudicot model Arabidopsis thaliana. Here, we explore the floral transcriptome of the basal angiosperm, Nuphar advena (water lily), for insights into the ancestral developmental program of flowers. We identify several thousand Nuphar genes with significantly upregulated floral expression, including homologs of the well-known ABCE floral regulators, deployed in broadly overlapping transcriptional programs across floral organ categories. Strong similarities in the expression profiles of different organ categories in Nuphar flowers are shared with the magnoliid Persea americana (avocado), in contrast to the largely organ-specific transcriptional cascades evident in Arabidopsis, supporting the inference that this is the ancestral condition in angiosperms. In contrast to most eudicots, floral organs are weakly differentiated in Nuphar and Persea, with staminodial intermediates between stamens and perianth in Nuphar, and between stamens and carpels in Persea. Consequently, the predominantly organ-specific transcriptional programs that characterize Arabidopsis flowers (and perhaps other eudicots) are derived, and correlate with a shift towards morphologically distinct floral organs, including differentiated sepals and petals, and a perianth distinct from stamens and carpels. Our findings suggest that the genetic regulation of more spatially discrete transcriptional programs underlies the evolution of floral morphology.
2018
Cited 18 times
Points of Significance: Statistics versus Machine Learning
DOI: 10.1093/bioinformatics/btv104
2015
Cited 17 times
Estimating the proportion of true null hypotheses when the statistics are discrete
Abstract Motivation: In high-dimensional testing problems π0, the proportion of null hypotheses that are true is an important parameter. For discrete test statistics, the P values come from a discrete distribution with finite support and the null distribution may depend on an ancillary statistic such as a table margin that varies among the test statistics. Methods for estimating π0 developed for continuous test statistics, which depend on a uniform or identical null distribution of P values, may not perform well when applied to discrete testing problems. Results: This article introduces a number of π0 estimators, the regression and ‘T’ methods that perform well with discrete test statistics and also assesses how well methods developed for or adapted from continuous tests perform with discrete tests. We demonstrate the usefulness of these estimators in the analysis of high-throughput biological RNA-seq and single-nucleotide polymorphism data. Availability and implementation: implemented in R Contact: nsa1@psu.edu or naomi@psu.edu Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1038/s41592-020-0943-4
2020
Cited 14 times
Uncertainty and the management of epidemics
“I have no idea what’s awaiting me, or what will happen when this all ends. For the moment I know this: there are sick people and they need curing.” ―Albert Camus, The Plague
DOI: 10.1093/g3journal/jkad120
2023
Transcriptomic approach to uncover dynamic events in the development of mid-season sunburn in apple fruit
Apples grown in high heat, high light, and low humidity environments are at risk for sun injury disorders like sunburn and associated crop losses. Understanding the physiological and molecular mechanisms underlying sunburn will support improvement of mitigation strategies and breeding for more resilient varieties. Numerous studies have highlighted key biochemical processes involved in sun injury, such as the phenylpropanoid and reactive oxygen species (ROS) pathways, demonstrating both enzyme activities and expression of related genes in response to sunburn conditions. Most previous studies have focused on at-harvest activity of a small number of genes in response to heat stress. Thus, it remains unclear how stress events earlier in the season affect physiology and gene expression. Here, we applied heat stress to mid-season apples in the field and collected tissue along a time course-24, 48, and 72 h following a heat stimulus-to investigate dynamic gene expression changes using a transcriptomic lens. We found a relatively small number of differentially expressed genes (DEGs) and enriched functional terms in response to heat treatments. Only a few of these belonged to pathways previously described to be involved in sunburn, such as the AsA-GSH pathway, while most DEGs had not yet been implicated in sunburn or heat stress in pome fruit.
DOI: 10.1038/s41592-024-02234-5
2024
Comparing classifier performance with baselines
DOI: 10.1094/phyto-83-716
1993
Cited 33 times
Barley Yellow Dwarf Virus Isolate-Specific Resistance in Spring Oats Reduced Virus Accumulation and Aphid Transmission
Resistance to barley yellow dwarf virus in a spring oat genotype was manifested as a reduction in accumulation or viral antigen in whole plants. The resistance was quantified for five isolates or BYDV from New York (MAV, PAV, SGV, RPV, and RMV) and round to be BYDV isolate specific. Similar levels or resistance were quantified for MAV, PAV, and SGV in which the reduction in viral antigen ranged from 58-63 percent, relative to levels in a susceptible genotype. RMV antigen was reduced up to 82 percent, but no resistance was expressed against the RPV isolate. Reduced viral antigen was correlated with reduced levels or transmission or MAV, PAV, SGV, and RMV, but not RPV. Further reductions in transmission efficiency were possible by limiting the acquisition access period [...]
DOI: 10.1175/mwr-d-10-05081.1
2011
Cited 18 times
Investigation of Ensemble Variance as a Measure of True Forecast Variance
Abstract The uncertainty in meteorological predictions is of interest for applications ranging from economic to recreational to public safety. One common method to estimate uncertainty is by using meteorological ensembles. These ensembles provide an easily quantifiable measure of the uncertainty in the forecast in the form of the ensemble variance. However, ensemble variance may not accurately reflect the actual uncertainty, so any measure of uncertainty derived from the ensemble should be calibrated to provide a more reliable estimate of the actual uncertainty in the forecast. A previous study introduced the linear variance calibration (LVC) as a simple method to determine the ensemble variance to error variance relationship and demonstrated this technique on real ensemble data. The LVC parameters, the slopes, and y intercepts, however, are generally different from the ideal values. This current study uses a stochastic model to examine the LVC in a controlled setting. The stochastic model is capable of simulating underdispersive and overdispersive ensembles as well as perfectly reliable ensembles. Because the underlying relationship is specified, LVC results can be compared to theoretical values of the slope and y intercept. Results indicate that all types of ensembles produce calibration slopes that are smaller than their theoretical values for ensemble sizes less than several hundred members, with corresponding y intercepts greater than their theoretical values. This indicates that all ensembles, even otherwise perfect ensembles, should be calibrated if the ensemble size is less than several hundred. In addition, it is shown that an adjustment factor can be computed for inadequate ensemble size. This adjustment factor is independent of the stochastic model and is applicable to any linear regression of error variance on ensemble variance. When applied to experiments using the stochastic model, the adjustment produces LVC parameters near their theoretical values for all ensemble sizes. Although the adjustment is unnecessary when applying LVC, it allows for a more accurate assessment of the reliability of ensembles, and a fair comparison of the reliability for differently sized ensembles.
DOI: 10.1038/nmeth.2974
2014
Cited 15 times
Designing comparative experiments
Good experimental designs limit the impact of variability and reduce sample-size requirements.
DOI: 10.1038/s41592-022-01563-7
2022
Cited 6 times
Survival analysis—time-to-event data and censoring
DOI: 10.1038/s41592-019-0335-9
2019
Cited 12 times
Two-level factorial experiments
Simultaneous examination of multiple factors at two levels can reveal which have an effect.