ϟ

Fabian J. Theis

Here are all the papers by Fabian J. Theis that you can download and read on OA.mg.
Fabian J. Theis’s last known institution is . Download Fabian J. Theis PDFs here.

Claim this Profile →
DOI: 10.1038/s41587-020-0591-3
2020
Cited 1,534 times
Generalizing RNA velocity to transient cell states through dynamical modeling
RNA velocity has opened up new ways of studying cellular differentiation in single-cell RNA-sequencing data. It describes the rate of gene expression change for an individual gene at a given time point based on the ratio of its spliced and unspliced messenger RNA (mRNA). However, errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. Here we present scVelo, a method that overcomes these limitations by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to systems with transient cell states, which are common in development and in response to perturbations. We apply scVelo to disentangling subpopulation kinetics in neurogenesis and pancreatic endocrinogenesis. We infer gene-specific rates of transcription, splicing and degradation, recover each cell's position in the underlying differentiation processes and detect putative driver genes. scVelo will facilitate the study of lineage decisions and gene regulation.
DOI: 10.15252/msb.20188746
2019
Cited 1,384 times
Current best practices in single‐cell RNA‐seq analysis: a tutorial
Single-cell RNA-seq has enabled gene expression to be studied at an unprecedented resolution. The promise of this technology is attracting a growing user base for single-cell analysis methods. As more analysis tools are becoming available, it is becoming increasingly difficult to navigate this landscape and produce an up-to-date workflow to analyse one's data. Here, we detail the steps of a typical single-cell RNA-seq analysis, including pre-processing (quality control, normalization, data correction, feature selection, and dimensionality reduction) and cell- and gene-level downstream analysis. We formulate current best-practice recommendations for these steps based on independent comparison studies. We have integrated these best-practice recommendations into a workflow, which we apply to a public dataset to further illustrate how these steps work in practice. Our documented case study can be found at https://www.github.com/theislab/single-cell-tutorial This review will serve as a workflow tutorial for new entrants into the field, and help established users update their analysis pipelines.
DOI: 10.1038/ng.2982
2014
Cited 1,120 times
An atlas of genetic influences on human blood metabolites
Genome-wide association scans with high-throughput metabolic profiling provide unprecedented insights into how genetic variation influences metabolism and complex disease. Here we report the most comprehensive exploration of genetic loci influencing human metabolism thus far, comprising 7,824 adult individuals from 2 European population studies. We report genome-wide significant associations at 145 metabolic loci and their biochemical connectivity with more than 400 metabolites in human blood. We extensively characterize the resulting in vivo blueprint of metabolism in human blood by integrating it with information on gene expression, heritability and overlap with known loci for complex disorders, inborn errors of metabolism and pharmacological targets. We further developed a database and web-based resources for data mining and results visualization. Our findings provide new insights into the role of inherited variation in blood metabolic diversity and identify potential new opportunities for drug development and for understanding disease.
DOI: 10.1038/nbt.3102
2015
Cited 1,096 times
Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells
Recent technical developments have enabled the transcriptomes of hundreds of cells to be assayed in an unbiased manner, opening up the possibility that new subpopulations of cells can be found. However, the effects of potential confounding factors, such as the cell cycle, on the heterogeneity of gene expression and therefore on the ability to robustly identify subpopulations remain unclear. We present and validate a computational approach that uses latent variable models to account for such hidden factors. We show that our single-cell latent variable model (scLVM) allows the identification of otherwise undetectable subpopulations of cells that correspond to different stages during the differentiation of naive T cells into T helper 2 cells. Our approach can be used not only to identify cellular subpopulations but also to tease apart different sources of gene expression heterogeneity in single-cell transcriptomes.
DOI: 10.1038/nmeth.3971
2016
Cited 1,006 times
Diffusion pseudotime robustly reconstructs lineage branching
Diffusion pseudotime (DPT) enables robust and scalable inference of cellular trajectories, branching events, metastable states and underlying gene dynamics from snapshot single-cell gene expression data. The temporal order of differentiating cells is intrinsically encoded in their single-cell expression profiles. We describe an efficient way to robustly estimate this order according to diffusion pseudotime (DPT), which measures transitions between cells using diffusion-like random walks. Our DPT software implementations make it possible to reconstruct the developmental progression of cells and identify transient or metastable states, branching decisions and differentiation endpoints.
DOI: 10.1038/s41576-019-0122-6
2019
Cited 709 times
Deep learning: new computational modelling techniques for genomics
DOI: 10.1038/s41467-018-07931-2
2019
Cited 681 times
Single-cell RNA-seq denoising using a deep count autoencoder
Abstract Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNA-seq data are needed. We propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a negative binomial noise model with or without zero-inflation, and nonlinear gene-gene dependencies are captured. Our method scales linearly with the number of cells and can, therefore, be applied to datasets of millions of cells. We demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.
DOI: 10.1093/bioinformatics/btv325
2015
Cited 576 times
Diffusion maps for high-dimensional single-cell analysis of differentiation data
Abstract Motivation: Single-cell technologies have recently gained popularity in cellular differentiation studies regarding their ability to resolve potential heterogeneities in cell populations. Analyzing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multivariate approaches are based on data normalization, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clusters to be present but instead expect the cells to follow continuous branching lineages. Results: Here, we propose the use of diffusion maps to deal with the problem of defining differentiation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudotemporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the transcriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction datasets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis and are advantageous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding for preserving the global structures and pseudotemporal ordering of cells. Availability and implementation: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. Contact: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1093/bioinformatics/btv715
2015
Cited 520 times
<i>destiny</i>: diffusion maps for large-scale single-cell data in R
Abstract Summary: Diffusion maps are a spectral method for non-linear dimension reduction and have recently been adapted for the visualization of single-cell expression data. Here we present destiny, an efficient R implementation of the diffusion map algorithm. Our package includes a single-cell specific noise model allowing for missing and censored values. In contrast to previous implementations, we further present an efficient nearest-neighbour approximation that allows for the processing of hundreds of thousands of cells and a functionality for projecting new data on existing diffusion maps. We exemplarily apply destiny to a recent time-resolved mass cytometry dataset of cellular reprogramming. Availability and implementation: destiny is an open-source R/Bioconductor package “bioconductor.org/packages/destiny” also available at www.helmholtz-muenchen.de/icb/destiny. A detailed vignette describing functions and workflows is provided with the package. Contact: carsten.marr@helmholtz-muenchen.de or f.buettner@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1038/s41592-021-01336-8
2021
Cited 470 times
Benchmarking atlas-level data integration in single-cell genomics
Single-cell atlases often include samples that span locations, laboratories and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. To guide integration method choice, we benchmarked 68 method and preprocessing combinations on 85 batches of gene expression, chromatin accessibility and simulation data from 23 publications, altogether representing >1.2 million cells distributed in 13 atlas-level integration tasks. We evaluated methods according to scalability, usability and their ability to remove batch effects while retaining biological variation using 14 evaluation metrics. We show that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, scANVI, Scanorama, scVI and scGen perform well, particularly on complex integration tasks, while single-cell ATAC-sequencing integration performance is strongly affected by choice of feature space. Our freely available Python module and benchmarking pipeline can identify optimal data integration methods for new data, benchmark new methods and improve method development.
DOI: 10.1126/science.aaq1723
2018
Cited 404 times
Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics
Flatworms of the species Schmidtea mediterranea are immortal-adult animals contain a large pool of pluripotent stem cells that continuously differentiate into all adult cell types. Therefore, single-cell transcriptome profiling of adult animals should reveal mature and progenitor cells. By combining perturbation experiments, gene expression analysis, a computational method that predicts future cell states from transcriptional changes, and a lineage reconstruction method, we placed all major cell types onto a single lineage tree that connects all cells to a single stem cell compartment. We characterized gene expression changes during differentiation and discovered cell types important for regeneration. Our results demonstrate the importance of single-cell transcriptome analysis for mapping and reconstructing fundamental processes of developmental and regenerative biology at high resolution.
DOI: 10.1038/s41586-021-03583-3
2021
Cited 395 times
Swarm Learning for decentralized and confidential clinical machine learning
Abstract Fast and reliable detection of patients with severe and heterogeneous illnesses is a major goal of precision medicine 1,2 . Patients with leukaemia can be identified using machine learning on the basis of their blood transcriptomes 3 . However, there is an increasing divide between what is technically possible and what is allowed, because of privacy legislation 4,5 . Here, to facilitate the integration of any medical data from any data owner worldwide without violating privacy laws, we introduce Swarm Learning—a decentralized machine-learning approach that unites edge computing, blockchain-based peer-to-peer networking and coordination while maintaining confidentiality without the need for a central coordinator, thereby going beyond federated learning. To illustrate the feasibility of using Swarm Learning to develop disease classifiers using distributed data, we chose four use cases of heterogeneous diseases (COVID-19, tuberculosis, leukaemia and lung pathologies). With more than 16,400 blood transcriptomes derived from 127 clinical studies with non-uniform distributions of cases and controls and substantial study biases, as well as more than 95,000 chest X-ray images, we show that Swarm Learning classifiers outperform those developed at individual sites. In addition, Swarm Learning completely fulfils local confidentiality regulations by design. We believe that this approach will notably accelerate the introduction of precision medicine.
DOI: 10.1016/j.stem.2015.04.004
2015
Cited 389 times
Combined Single-Cell Functional and Gene Expression Analysis Resolves Heterogeneity within Stem Cell Populations
Heterogeneity within the self-renewal durability of adult hematopoietic stem cells (HSCs) challenges our understanding of the molecular framework underlying HSC function. Gene expression studies have been hampered by the presence of multiple HSC subtypes and contaminating non-HSCs in bulk HSC populations. To gain deeper insight into the gene expression program of murine HSCs, we combined single-cell functional assays with flow cytometric index sorting and single-cell gene expression assays. Through bioinformatic integration of these datasets, we designed an unbiased sorting strategy that separates non-HSCs away from HSCs, and single-cell transplantation experiments using the enriched population were combined with RNA-seq data to identify key molecules that associate with long-term durable self-renewal, producing a single-cell molecular dataset that is linked to functional stem cell activity. Finally, we demonstrated the broader applicability of this approach for linking key molecules with defined cellular functions in another stem cell system.
DOI: 10.1371/journal.pcbi.1000385
2009
Cited 378 times
Hypergraphs and Cellular Networks
3,41Max Planck Institute for Dynamics of Complex Technical Systems, Magdeburg, Germany, 2Institute for Mathematical Optimization, Faculty of Mathematics, Otto-von-Guericke University Magdeburg, Magdeburg, Germany, 3Institute for Bioinformatics and Systems Biology, Helmholtz Zentrum Mu¨nchen—German Research Center forEnvironmental Health, Neuherberg, Germany, 4Max Planck Institute for Dynamics and Self-Organization, Go¨ttingen, Germany
DOI: 10.1038/nbt.3154
2015
Cited 366 times
Decoding the regulatory network of early blood development from single-cell gene expression measurements
An early stage in mouse blood development is reconstructed from gene expression data on thousands of single cells. Reconstruction of the molecular pathways controlling organ development has been hampered by a lack of methods to resolve embryonic progenitor cells. Here we describe a strategy to address this problem that combines gene expression profiling of large numbers of single cells with data analysis based on diffusion maps for dimensionality reduction and network synthesis from state transition graphs. Applying the approach to hematopoietic development in the mouse embryo, we map the progression of mesoderm toward blood using single-cell gene expression analysis of 3,934 cells with blood-forming potential captured at four time points between E7.0 and E8.5. Transitions between individual cellular states are then used as input to develop a single-cell network synthesis toolkit to generate a computationally executable transcriptional regulatory network model of blood development. Several model predictions concerning the roles of Sox and Hox factors are validated experimentally. Our results demonstrate that single-cell analysis of a developing organ coupled with computational approaches can reveal the transcriptional programs that underpin organogenesis.
DOI: 10.1038/nn.3371
2013
Cited 354 times
Live imaging of astrocyte responses to acute injury reveals selective juxtavascular proliferation
DOI: 10.1371/journal.pgen.1002215
2011
Cited 332 times
Discovery of Sexual Dimorphisms in Metabolic and Genetic Biomarkers
Metabolomic profiling and the integration of whole-genome genetic association data has proven to be a powerful tool to comprehensively explore gene regulatory networks and to investigate the effects of genetic variation at the molecular level. Serum metabolite concentrations allow a direct readout of biological processes, and association of specific metabolomic signatures with complex diseases such as Alzheimer's disease and cardiovascular and metabolic disorders has been shown. There are well-known correlations between sex and the incidence, prevalence, age of onset, symptoms, and severity of a disease, as well as the reaction to drugs. However, most of the studies published so far did not consider the role of sexual dimorphism and did not analyse their data stratified by gender. This study investigated sex-specific differences of serum metabolite concentrations and their underlying genetic determination. For discovery and replication we used more than 3,300 independent individuals from KORA F3 and F4 with metabolite measurements of 131 metabolites, including amino acids, phosphatidylcholines, sphingomyelins, acylcarnitines, and C6-sugars. A linear regression approach revealed significant concentration differences between males and females for 102 out of 131 metabolites (p-values<3.8×10(-4); Bonferroni-corrected threshold). Sex-specific genome-wide association studies (GWAS) showed genome-wide significant differences in beta-estimates for SNPs in the CPS1 locus (carbamoyl-phosphate synthase 1, significance level: p<3.8×10(-10); Bonferroni-corrected threshold) for glycine. We showed that the metabolite profiles of males and females are significantly different and, furthermore, that specific genetic variants in metabolism-related genes depict sexual dimorphism. Our study provides new important insights into sex-specific differences of cell regulatory processes and underscores that studies should consider sex-specific effects in design and interpretation.
DOI: 10.1038/s41592-021-01358-2
2022
Cited 330 times
Squidpy: a scalable framework for spatial omics analysis
Spatial omics data are advancing the study of tissue organization and cellular communication at an unprecedented scale. Flexible tools are required to store, integrate and visualize the large diversity of spatial omics data. Here, we present Squidpy, a Python framework that brings together tools from omics and image analysis to enable scalable description of spatial molecular data, such as transcriptome or multivariate proteins. Squidpy provides efficient infrastructure and numerous analysis methods that allow to efficiently store, manipulate and interactively visualize spatial omics data. Squidpy is extensible and can be interfaced with a variety of already existing libraries for the scalable analysis of spatial omics data.
DOI: 10.1109/tnn.2005.849840
2005
Cited 329 times
Sparse Component Analysis and Blind Source Separation of Underdetermined Mixtures
In this letter, we solve the problem of identifying matrices S is an element of R(n x N) and A is an element of R(m x n) knowing only their multiplication X = AS, under some conditions, expressed either in terms of A and sparsity of S (identifiability conditions), or in terms of X (sparse component analysis (SCA) conditions). We present algorithms for such identification and illustrate them by examples.
DOI: 10.1038/s41592-021-01346-6
2022
Cited 321 times
CellRank for directed single-cell fate mapping
Computational trajectory inference enables the reconstruction of cell state dynamics from single-cell RNA sequencing experiments. However, trajectory inference requires that the direction of a biological process is known, largely limiting its application to differentiating systems in normal development. Here, we present CellRank ( https://cellrank.org ) for single-cell fate mapping in diverse scenarios, including regeneration, reprogramming and disease, for which direction is unknown. Our approach combines the robustness of trajectory inference with directional information from RNA velocity, taking into account the gradual and stochastic nature of cellular fate decisions, as well as uncertainty in velocity vectors. On pancreas development data, CellRank automatically detects initial, intermediate and terminal populations, predicts fate potentials and visualizes continuous gene expression trends along individual lineages. Applied to lineage-traced cellular reprogramming data, predicted fate probabilities correctly recover reprogramming outcomes. CellRank also predicts a new dedifferentiation trajectory during postinjury lung regeneration, including previously unknown intermediate cell states, which we confirm experimentally.
DOI: 10.1038/s41592-019-0494-8
2019
Cited 307 times
scGen predicts single-cell perturbation responses
DOI: 10.1038/ng.3527
2016
Cited 292 times
Epigenetic germline inheritance of diet-induced obesity and insulin resistance
DOI: 10.1038/s41592-018-0254-1
2018
Cited 292 times
A test metric for assessing single-cell RNA-seq batch correction
Single-cell transcriptomics is a versatile tool for exploring heterogeneous cell populations, but as with all genomics experiments, batch effects can hamper data integration and interpretation. The success of batch-effect correction is often evaluated by visual inspection of low-dimensional embeddings, which are inherently imprecise. Here we present a user-friendly, robust and sensitive k-nearest-neighbor batch-effect test (kBET; https://github.com/theislab/kBET ) for quantification of batch effects. We used kBET to assess commonly used batch-regression and normalization approaches, and to quantify the extent to which they remove batch effects while preserving biological variability. We also demonstrate the application of kBET to data from peripheral blood mononuclear cells (PBMCs) from healthy donors to distinguish cell-type-specific inter-individual variability from changes in relative proportions of cell populations. This has important implications for future data-integration efforts, central to projects such as the Human Cell Atlas.
DOI: 10.15252/msb.202110798
2022
Cited 292 times
Ultra‐high sensitivity mass spectrometry quantifies single‐cell proteome changes upon perturbation
Single-cell technologies are revolutionizing biology but are today mainly limited to imaging and deep sequencing. However, proteins are the main drivers of cellular function and in-depth characterization of individual cells by mass spectrometry (MS)-based proteomics would thus be highly valuable and complementary. Here, we develop a robust workflow combining miniaturized sample preparation, very low flow-rate chromatography, and a novel trapped ion mobility mass spectrometer, resulting in a more than 10-fold improved sensitivity. We precisely and robustly quantify proteomes and their changes in single, FACS-isolated cells. Arresting cells at defined stages of the cell cycle by drug treatment retrieves expected key regulators. Furthermore, it highlights potential novel ones and allows cell phase prediction. Comparing the variability in more than 430 single-cell proteomes to transcriptome data revealed a stable-core proteome despite perturbation, while the transcriptome appears stochastic. Our technology can readily be applied to ultra-high sensitivity analyses of tissue material, posttranslational modifications, and small molecule studies from small cell counts to gain unprecedented insights into cellular heterogeneity in health and disease.
DOI: 10.1371/journal.pone.0074335
2013
Cited 291 times
Lessons Learned from Quantitative Dynamical Modeling in Systems Biology
Due to the high complexity of biological data it is difficult to disentangle cellular processes relying only on intuitive interpretation of measurements. A Systems Biology approach that combines quantitative experimental data with dynamic mathematical modeling promises to yield deeper insights into these processes. Nevertheless, with growing complexity and increasing amount of quantitative experimental data, building realistic and reliable mathematical models can become a challenging task: the quality of experimental data has to be assessed objectively, unknown model parameters need to be estimated from the experimental data, and numerical calculations need to be precise and efficient. Here, we discuss, compare and characterize the performance of computational methods throughout the process of quantitative dynamic modeling using two previously established examples, for which quantitative, dose- and time-resolved experimental data are available. In particular, we present an approach that allows to determine the quality of experimental data in an efficient, objective and automated manner. Using this approach data generated by different measurement techniques and even in single replicates can be reliably used for mathematical modeling. For the estimation of unknown model parameters, the performance of different optimization algorithms was compared systematically. Our results show that deterministic derivative-based optimization employing the sensitivity equations in combination with a multi-start strategy based on latin hypercube sampling outperforms the other methods by orders of magnitude in accuracy and speed. Finally, we investigated transformations that yield a more efficient parameterization of the model and therefore lead to a further enhancement in optimization performance. We provide a freely available open source software package that implements the algorithms and examples compared here.
DOI: 10.1523/jneurosci.3030-10.2010
2010
Cited 285 times
MicroRNA Loss Enhances Learning and Memory in Mice
Dicer-dependent noncoding RNAs, including microRNAs (miRNAs), play an important role in a modulation of translation of mRNA transcripts necessary for differentiation in many cell types. In vivo experiments using cell type-specific Dicer1 gene inactivation in neurons showed its essential role for neuronal development and survival. However, little is known about the consequences of a loss of miRNAs in adult, fully differentiated neurons. To address this question, we used an inducible variant of the Cre recombinase (tamoxifen-inducible CreERT2) under control of Camk2a gene regulatory elements. After induction of Dicer1 gene deletion in adult mouse forebrain, we observed a progressive loss of a whole set of brain-specific miRNAs. Animals were tested in a battery of both aversively and appetitively motivated cognitive tasks, such as Morris water maze, IntelliCage system, or trace fear conditioning. Compatible with rather long half-life of miRNAs in hippocampal neurons, we observed an enhancement of memory strength of mutant mice 12 weeks after the Dicer1 gene mutation, before the onset of neurodegenerative process. In acute brain slices, immediately after high-frequency stimulation of the Schaffer collaterals, the efficacy at CA3-to-CA1 synapses was higher in mutant than in control mice, whereas long-term potentiation was comparable between genotypes. This phenotype was reflected at the subcellular and molecular level by the elongated filopodia-like shaped dendritic spines and an increased translation of synaptic plasticity-related proteins, such as BDNF and MMP-9 in mutant animals. The presented work shows miRNAs as key players in the learning and memory process of mammals.
DOI: 10.1186/1752-0509-5-21
2011
Cited 285 times
Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data
With the advent of high-throughput targeted metabolic profiling techniques, the question of how to interpret and analyze the resulting vast amount of data becomes more and more important. In this work we address the reconstruction of metabolic reactions from cross-sectional metabolomics data, that is without the requirement for time-resolved measurements or specific system perturbations. Previous studies in this area mainly focused on Pearson correlation coefficients, which however are generally incapable of distinguishing between direct and indirect metabolic interactions.In our new approach we propose the application of a Gaussian graphical model (GGM), an undirected probabilistic graphical model estimating the conditional dependence between variables. GGMs are based on partial correlation coefficients, that is pairwise Pearson correlation coefficients conditioned against the correlation with all other metabolites. We first demonstrate the general validity of the method and its advantages over regular correlation networks with computer-simulated reaction systems. Then we estimate a GGM on data from a large human population cohort, covering 1020 fasting blood serum samples with 151 quantified metabolites. The GGM is much sparser than the correlation network, shows a modular structure with respect to metabolite classes, and is stable to the choice of samples in the data set. On the example of human fatty acid metabolism, we demonstrate for the first time that high partial correlation coefficients generally correspond to known metabolic reactions. This feature is evaluated both manually by investigating specific pairs of high-scoring metabolites, and then systematically on a literature-curated model of fatty acid synthesis and degradation. Our method detects many known reactions along with possibly novel pathway interactions, representing candidates for further experimental examination.In summary, we demonstrate strong signatures of intracellular pathways in blood serum data, and provide a valuable tool for the unbiased reconstruction of metabolic reactions from large-scale metabolomics data sets.
DOI: 10.1096/fj.11-198093
2012
Cited 280 times
The dynamic range of the human metabolome revealed by challenges
Metabolic challenge protocols, such as the oral glucose tolerance test, can uncover early alterations in metabolism preceding chronic diseases. Nevertheless, most metabolomics data accessible today reflect the fasting state. To analyze the dynamics of the human metabolome in response to environmental stimuli, we submitted 15 young healthy male volunteers to a highly controlled 4 d challenge protocol, including 36 h fasting, oral glucose and lipid tests, liquid test meals, physical exercise, and cold stress. Blood, urine, exhaled air, and breath condensate samples were analyzed on up to 56 time points by MS-and NMR-based methods, yielding 275 metabolic traits with a focus on lipids and amino acids. Here, we show that physiological challenges increased interindividual variation even in phenotypically similar volunteers, revealing metabotypes not observable in baseline metabolite profiles; volunteer-specific metabolite concentrations were consistently reflected in various biofluids; and readouts from a systematic model of β-oxidation (e.g., acetylcarnitine/palmitylcarnitine ratio) showed significant and stronger associations with physiological parameters (e.g., fat mass) than absolute metabolite concentrations, indicating that systematic models may aid in understanding individual challenge responses. Due to the multitude of analytical methods, challenges and sample types, our freely available metabolomics data set provides a unique reference for future metabolomics studies and for verification of systems biology models.—Krug, S., Kastenmüller, G., Stückler, F., Rist, M. J., Skurk, T., Sailer, M., Raffler, J., Römisch-Margl, W., Adamski, J., Prehn, C., Frank, T., Engel, K-H., Hofmann, T., Luy, B., Zimmermann, R., Moritz, F., Schmitt-Kopplin, P., Krumsiek, J., Kremer, W., Huber, F., Oeh, U., Theis, F. J., Szymczak, W., Hauner, H., Suhre, K., Daniel, H. The dynamic range of the human metabolome revealed by challenges. FASEB J. 26, 2607-2619 (2012). www.fasebj.org
DOI: 10.1038/ncb2709
2013
Cited 259 times
Characterization of transcriptional networks in blood stem and progenitor cells using high-throughput single-cell gene expression analysis
Cellular decision-making is mediated by a complex interplay of external stimuli with the intracellular environment, in particular transcription factor regulatory networks. Here we have determined the expression of a network of 18 key haematopoietic transcription factors in 597 single primary blood stem and progenitor cells isolated from mouse bone marrow. We demonstrate that different stem/progenitor populations are characterized by distinctive transcription factor expression states, and through comprehensive bioinformatic analysis reveal positively and negatively correlated transcription factor pairings, including previously unrecognized relationships between Gata2, Gfi1 and Gfi1b. Validation using transcriptional and transgenic assays confirmed direct regulatory interactions consistent with a regulatory triad in immature blood stem cells, where Gata2 may function to modulate cross-inhibition between Gfi1 and Gfi1b. Single-cell expression profiling therefore identifies network states and allows reconstruction of network hierarchies involved in controlling stem cell fate choices, and provides a blueprint for studying both normal development and human disease.
DOI: 10.1186/gb-2010-11-1-r6
2010
Cited 257 times
PhenomiR: a knowledgebase for microRNA expression in diseases and biological processes
In recent years, microRNAs have been shown to play important roles in physiological as well as malignant processes. The PhenomiR database http://mips.helmholtz-muenchen.de/phenomir provides data from 542 studies that investigate deregulation of microRNA expression in diseases and biological processes as a systematic, manually curated resource. Using the PhenomiR dataset, we could demonstrate that, depending on disease type, independent information from cell culture studies contrasts with conclusions drawn from patient studies.
DOI: 10.1038/ncomms10256
2016
Cited 249 times
Label-free cell cycle analysis for high-throughput imaging flow cytometry
Imaging flow cytometry combines the high-throughput capabilities of conventional flow cytometry with single-cell imaging. Here we demonstrate label-free prediction of DNA content and quantification of the mitotic cell cycle phases by applying supervised machine learning to morphological features extracted from brightfield and the typically ignored darkfield images of cells from an imaging flow cytometer. This method facilitates non-destructive monitoring of cells avoiding potentially confounding effects of fluorescent stains while maximizing available fluorescence channels. The method is effective in cell cycle analysis for mammalian cells, both fixed and live, and accurately assesses the impact of a cell cycle mitotic phase blocking agent. As the same method is effective in predicting the DNA content of fission yeast, it is likely to have a broad application to other cell types.
DOI: 10.5936/csbj.201301009
2013
Cited 240 times
STATISTICAL METHODS FOR THE ANALYSIS OF HIGH-THROUGHPUT METABOLOMICS DATA
Metabolomics is a relatively new high-throughput technology that aims at measuring all endogenous metabolites within a biological sample in an unbiased fashion. The resulting metabolic profiles may be regarded as functional signatures of the physiological state, and have been shown to comprise effects of genetic regulation as well as environmental factors. This potential to connect genotypic to phenotypic information promises new insights and biomarkers for different research fields, including biomedical and pharmaceutical research. In the statistical analysis of metabolomics data, many techniques from other omics fields can be reused. However recently, a number of tools specific for metabolomics data have been developed as well. The focus of this mini review will be on recent advancements in the analysis of metabolomics data especially by utilizing Gaussian graphical models and independent component analysis.
DOI: 10.1038/s41587-021-01206-w
2022
Cited 229 times
A Python library for probabilistic analysis of single-cell omics data
DOI: 10.1038/s41467-017-00623-3
2017
Cited 228 times
Reconstructing cell cycle and disease progression using deep learning
We show that deep convolutional neural networks combined with nonlinear dimension reduction enable reconstructing biological processes based on raw image data. We demonstrate this by reconstructing the cell cycle of Jurkat cells and disease progression in diabetic retinopathy. In further analysis of Jurkat cells, we detect and separate a subpopulation of dead cells in an unsupervised manner and, in classifying discrete cell cycle stages, we reach a sixfold reduction in error rate compared to a recent approach based on boosting on image features. In contrast to previous methods, deep learning based predictions are fast enough for on-the-fly analysis in an imaging flow cytometer.The interpretation of information-rich, high-throughput single-cell data is a challenge requiring sophisticated computational tools. Here the authors demonstrate a deep convolutional neural network that can classify cell cycle status on-the-fly.
DOI: 10.1038/ncomms14836
2017
Cited 223 times
A BaSiC tool for background and shading correction of optical microscopy images
Abstract Quantitative analysis of bioimaging data is often skewed by both shading in space and background variation in time. We introduce BaSiC, an image correction method based on low-rank and sparse decomposition which solves both issues. In comparison to existing shading correction tools, BaSiC achieves high-accuracy with significantly fewer input images, works for diverse imaging conditions and is robust against artefacts. Moreover, it can correct temporal drift in time-lapse microscopy data and thus improve continuous single-cell quantification. BaSiC requires no manual parameter setting and is available as a Fiji/ImageJ plugin.
DOI: 10.1186/s12943-015-0358-5
2015
Cited 220 times
Next-generation sequencing reveals novel differentially regulated mRNAs, lncRNAs, miRNAs, sdRNAs and a piRNA in pancreatic cancer
Previous studies identified microRNAs (miRNAs) and messenger RNAs with significantly different expression between normal pancreas and pancreatic cancer (PDAC) tissues. Due to technological limitations of microarrays and real-time PCR systems these studies focused on a fixed set of targets. Expression of other RNA classes such as long intergenic non-coding RNAs or sno-derived RNAs has rarely been examined in pancreatic cancer. Here, we analysed the coding and non-coding transcriptome of six PDAC and five control tissues using next-generation sequencing. Besides the confirmation of several deregulated mRNAs and miRNAs, miRNAs without previous implication in PDAC were detected: miR-802, miR-2114 or miR-561. SnoRNA-derived RNAs (e.g. sno-HBII-296B) and piR-017061, a piwi-interacting RNA, were found to be differentially expressed between PDAC and control tissues. In silico target analysis of miR-802 revealed potential binding sites in the 3′ UTR of TCF4, encoding a transcription factor that controls Wnt signalling genes. Overexpression of miR-802 in MiaPaCa pancreatic cancer cells reduced TCF4 protein levels. Using Massive Analysis of cDNA Ends (MACE) we identified differential expression of 43 lincRNAs, long intergenic non-coding RNAs, e.g. LINC00261 and LINC00152 as well as several natural antisense transcripts like HNF1A-AS1 and AFAP1-AS1. Differential expression was confirmed by qPCR on the mRNA/miRNA/lincRNA level and by immunohistochemistry on the protein level. Here, we report a novel lncRNA, sncRNA and mRNA signature of PDAC. In silico prediction of ncRNA targets allowed for assigning potential functions to differentially regulated RNAs.
DOI: 10.1007/s11306-015-0829-0
2015
Cited 218 times
Gender-specific pathway differences in the human serum metabolome
The susceptibility for various diseases as well as the response to treatments differ considerably between men and women. As a basis for a gender-specific personalized healthcare, an extensive characterization of the molecular differences between the two genders is required. In the present study, we conducted a large-scale metabolomics analysis of 507 metabolic markers measured in serum of 1756 participants from the German KORA F4 study (903 females and 853 males). One-third of the metabolites show significant differences between males and females. A pathway analysis revealed strong differences in steroid metabolism, fatty acids and further lipids, a large fraction of amino acids, oxidative phosphorylation, purine metabolism and gamma-glutamyl dipeptides. We then extended this analysis by a network-based clustering approach. Metabolite interactions were estimated using Gaussian graphical models to get an unbiased, fully data-driven metabolic network representation. This approach is not limited to possibly arbitrary pathway boundaries and can even include poorly or uncharacterized metabolites. The network analysis revealed several strongly gender-regulated submodules across different pathways. Finally, a gender-stratified genome-wide association study was performed to determine whether the observed gender differences are caused by dimorphisms in the effects of genetic polymorphisms on the metabolome. With only a single genome-wide significant hit, our results suggest that this scenario is not the case. In summary, we report an extensive characterization and interpretation of gender-specific differences of the human serum metabolome, providing a broad basis for future analyses.
DOI: 10.1093/bioinformatics/btv405
2015
Cited 215 times
Data2Dynamics: a modeling environment tailored to parameter estimation in dynamical systems
Modeling of dynamical systems using ordinary differential equations is a popular approach in the field of systems biology. Two of the most critical steps in this approach are to construct dynamical models of biochemical reaction networks for large datasets and complex experimental conditions and to perform efficient and reliable parameter estimation for model fitting. We present a modeling environment for MATLAB that pioneers these challenges. The numerically expensive parts of the calculations such as the solving of the differential equations and of the associated sensitivity system are parallelized and automatically compiled into efficient C code. A variety of parameter estimation algorithms as well as frequentist and Bayesian methods for uncertainty analysis have been implemented and used on a range of applications that lead to publications.The Data2Dynamics modeling environment is MATLAB based, open source and freely available at http://www.data2dynamics.org.andreas.raue@fdm.uni-freiburg.deSupplementary data are available at Bioinformatics online.
DOI: 10.1371/journal.pbio.1001300
2012
Cited 195 times
Social Transfer of Pathogenic Fungus Promotes Active Immunisation in Ant Colonies
Due to the omnipresent risk of epidemics, insect societies have evolved sophisticated disease defences at the individual and colony level. An intriguing yet little understood phenomenon is that social contact to pathogen-exposed individuals reduces susceptibility of previously naive nestmates to this pathogen. We tested whether such social immunisation in Lasius ants against the entomopathogenic fungus Metarhizium anisopliae is based on active upregulation of the immune system of nestmates following contact to an infectious individual or passive protection via transfer of immune effectors among group members--that is, active versus passive immunisation. We found no evidence for involvement of passive immunisation via transfer of antimicrobials among colony members. Instead, intensive allogrooming behaviour between naive and pathogen-exposed ants before fungal conidia firmly attached to their cuticle suggested passage of the pathogen from the exposed individuals to their nestmates. By tracing fluorescence-labelled conidia we indeed detected frequent pathogen transfer to the nestmates, where they caused low-level infections as revealed by growth of small numbers of fungal colony forming units from their dissected body content. These infections rarely led to death, but instead promoted an enhanced ability to inhibit fungal growth and an active upregulation of immune genes involved in antifungal defences (defensin and prophenoloxidase, PPO). Contrarily, there was no upregulation of the gene cathepsin L, which is associated with antibacterial and antiviral defences, and we found no increased antibacterial activity of nestmates of fungus-exposed ants. This indicates that social immunisation after fungal exposure is specific, similar to recent findings for individual-level immune priming in invertebrates. Epidemiological modeling further suggests that active social immunisation is adaptive, as it leads to faster elimination of the disease and lower death rates than passive immunisation. Interestingly, humans have also utilised the protective effect of low-level infections to fight smallpox by intentional transfer of low pathogen doses ("variolation" or "inoculation").
DOI: 10.1038/nature18320
2016
Cited 180 times
Early myeloid lineage choice is not initiated by random PU.1 to GATA1 protein ratios
Live imaging and single-cell analyses are used to show that decision-making by differentiating haematopoietic stem cells between the megakaryocytic–erythroid and granulocytic–monocytic lineages is not initiated by stochastic switching between the lineage-specific transcription factors PU.1 and GATA1, which challenges the previous model of early myeloid lineage choice. How lineage-inducing transcription factors promote cell fate in the haematopoietic lineage is debated. Population level analyses have uncovered the existence of positive feedback and mutual inhibition, which suggested that lineage choice is driven stochastically by fluctuating antagonistic transcription factors. Timm Schroeder and colleagues use live imaging and single-cell analyses to demonstrate that the decision made by haematopoietic stem cells to differentiate along the megakaryocytic–erythroid or granulocytic–monocytic lineage does not depend on such stochastic switching between lineage-specific transcription factors PU.1 and GATA1, a conclusion that challenges the previous model of early myeloid lineage choice. The mechanisms underlying haematopoietic lineage decisions remain disputed. Lineage-affiliated transcription factors1,2 with the capacity for lineage reprogramming3, positive auto-regulation4,5 and mutual inhibition6,7 have been described as being expressed in uncommitted cell populations8. This led to the assumption that lineage choice is cell-intrinsically initiated and determined by stochastic switches of randomly fluctuating cross-antagonistic transcription factors3. However, this hypothesis was developed on the basis of RNA expression data from snapshot and/or population-averaged analyses9,10,11,12. Alternative models of lineage choice therefore cannot be excluded. Here we use novel reporter mouse lines and live imaging for continuous single-cell long-term quantification of the transcription factors GATA1 and PU.1 (also known as SPI1). We analyse individual haematopoietic stem cells throughout differentiation into megakaryocytic–erythroid and granulocytic–monocytic lineages. The observed expression dynamics are incompatible with the assumption that stochastic switching between PU.1 and GATA1 precedes and initiates megakaryocytic–erythroid versus granulocytic–monocytic lineage decision-making. Rather, our findings suggest that these transcription factors are only executing and reinforcing lineage choice once made. These results challenge the current prevailing model of early myeloid lineage choice.
DOI: 10.1242/dev.170506
2019
Cited 179 times
Concepts and limitations for learning developmental trajectories from single cell genomics
Single cell genomics has become a popular approach to uncover the cellular heterogeneity of progenitor and terminally differentiated cell types with great precision. This approach can also delineate lineage hierarchies and identify molecular programmes of cell-fate acquisition and segregation. Nowadays, tens of thousands of cells are routinely sequenced in single cell-based methods and even more are expected to be analysed in the future. However, interpretation of the resulting data is challenging and requires computational models at multiple levels of abstraction. In contrast to other applications of single cell sequencing, where clustering approaches dominate, developmental systems are generally modelled using continuous structures, trajectories and trees. These trajectory models carry the promise of elucidating mechanisms of development, disease and stimulation response at very high molecular resolution. However, their reliable analysis and biological interpretation requires an understanding of their underlying assumptions and limitations. Here, we review the basic concepts of such computational approaches and discuss the characteristics of developmental processes that can be learnt from trajectory models.
DOI: 10.1126/scitranslmed.3008946
2014
Cited 175 times
Intraindividual genome expression analysis reveals a specific molecular signature of psoriasis and eczema
Signatures from patients with both psoriasis and eczema contribute to understanding disease pathogenesis and diagnosis.
DOI: 10.1038/nmeth.4182
2017
Cited 171 times
Prospective identification of hematopoietic lineage choice by deep learning
Differentiation alters molecular properties of stem and progenitor cells, leading to changes in their shape and movement characteristics. We present a deep neural network that prospectively predicts lineage choice in differentiating primary hematopoietic progenitors using image patches from brightfield microscopy and cellular movement. Surprisingly, lineage choice can be detected up to three generations before conventional molecular markers are observable. Our approach allows identification of cells with differentially expressed lineage-specifying genes without molecular labeling.
DOI: 10.1038/s41576-023-00586-w
2023
Cited 166 times
Best practices for single-cell analysis across modalities
Recent advances in single-cell technologies have enabled high-throughput molecular profiling of cells across modalities and locations. Single-cell transcriptomics data can now be complemented by chromatin accessibility, surface protein expression, adaptive immune receptor repertoire profiling and spatial information. The increasing availability of single-cell data across modalities has motivated the development of novel computational methods to help analysts derive biological insights. As the field grows, it becomes increasingly difficult to navigate the vast landscape of tools and analysis steps. Here, we summarize independent benchmarking studies of unimodal and multimodal single-cell analysis across modalities to suggest comprehensive best-practice workflows for the most common analysis steps. Where independent benchmarks are not available, we review and contrast popular methods. Our article serves as an entry point for novices in the field of single-cell (multi-)omic analysis and guides advanced users to the most recent best practices.
DOI: 10.15252/msb.20209923
2021
Cited 162 times
Integrated intra‐ and intercellular signaling knowledge for multicellular omics analysis
Article22 March 2021Open Access Transparent process Integrated intra- and intercellular signaling knowledge for multicellular omics analysis Dénes Türei Dénes Türei orcid.org/0000-0002-7249-9379 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Alberto Valdeolivas Alberto Valdeolivas orcid.org/0000-0001-5482-9023 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Lejla Gul Lejla Gul Earlham Institute, Norwich, UK Search for more papers by this author Nicolàs Palacio-Escat Nicolàs Palacio-Escat orcid.org/0000-0002-7022-1437 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Faculty of Medicine, Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Aachen, Germany Faculty of Biosciences, Heidelberg University, Heidelberg, Germany Search for more papers by this author Michal Klein Michal Klein orcid.org/0000-0002-2433-6380 Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany Search for more papers by this author Olga Ivanova Olga Ivanova orcid.org/0000-0002-9111-4593 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Márton Ölbei Márton Ölbei orcid.org/0000-0002-4903-6237 Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Attila Gábor Attila Gábor orcid.org/0000-0002-0776-1182 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Fabian Theis Fabian Theis orcid.org/0000-0002-2419-1943 Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany Department of Mathematics, Technical University of Munich, Garching, Germany Search for more papers by this author Dezső Módos Dezső Módos orcid.org/0000-0001-9412-6867 Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Tamás Korcsmáros Tamás Korcsmáros orcid.org/0000-0003-1717-996X Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Julio Saez-Rodriguez Corresponding Author Julio Saez-Rodriguez [email protected] orcid.org/0000-0002-8552-8976 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Faculty of Medicine, Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Aachen, Germany Search for more papers by this author Dénes Türei Dénes Türei orcid.org/0000-0002-7249-9379 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Alberto Valdeolivas Alberto Valdeolivas orcid.org/0000-0001-5482-9023 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Lejla Gul Lejla Gul Earlham Institute, Norwich, UK Search for more papers by this author Nicolàs Palacio-Escat Nicolàs Palacio-Escat orcid.org/0000-0002-7022-1437 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Faculty of Medicine, Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Aachen, Germany Faculty of Biosciences, Heidelberg University, Heidelberg, Germany Search for more papers by this author Michal Klein Michal Klein orcid.org/0000-0002-2433-6380 Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany Search for more papers by this author Olga Ivanova Olga Ivanova orcid.org/0000-0002-9111-4593 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Márton Ölbei Márton Ölbei orcid.org/0000-0002-4903-6237 Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Attila Gábor Attila Gábor orcid.org/0000-0002-0776-1182 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Search for more papers by this author Fabian Theis Fabian Theis orcid.org/0000-0002-2419-1943 Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany Department of Mathematics, Technical University of Munich, Garching, Germany Search for more papers by this author Dezső Módos Dezső Módos orcid.org/0000-0001-9412-6867 Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Tamás Korcsmáros Tamás Korcsmáros orcid.org/0000-0003-1717-996X Earlham Institute, Norwich, UK Quadram Institute Bioscience, Norwich, UK Search for more papers by this author Julio Saez-Rodriguez Corresponding Author Julio Saez-Rodriguez [email protected] orcid.org/0000-0002-8552-8976 Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany Faculty of Medicine, Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Aachen, Germany Search for more papers by this author Author Information Dénes Türei1, Alberto Valdeolivas1, Lejla Gul2, Nicolàs Palacio-Escat1,3,4, Michal Klein5, Olga Ivanova1, Márton Ölbei2,6, Attila Gábor1, Fabian Theis5,7, Dezső Módos2,6, Tamás Korcsmáros2,6 and Julio Saez-Rodriguez *,1,3 1Faculty of Medicine and Heidelberg University Hospital, Institute of Computational Biomedicine, Heidelberg University, Heidelberg, Germany 2Earlham Institute, Norwich, UK 3Faculty of Medicine, Joint Research Centre for Computational Biomedicine (JRC-COMBINE), RWTH Aachen University, Aachen, Germany 4Faculty of Biosciences, Heidelberg University, Heidelberg, Germany 5Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany 6Quadram Institute Bioscience, Norwich, UK 7Department of Mathematics, Technical University of Munich, Garching, Germany *Corresponding author. Tel: +49 6221 5451334; E-mail: [email protected] Molecular Systems Biology (2021)17:e9923https://doi.org/10.15252/msb.20209923 PDFDownload PDF of article text and main figures. Peer ReviewDownload a summary of the editorial decision process including editorial decision letters, reviewer comments and author responses to feedback. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract Molecular knowledge of biological processes is a cornerstone in omics data analysis. Applied to single-cell data, such analyses provide mechanistic insights into individual cells and their interactions. However, knowledge of intercellular communication is scarce, scattered across resources, and not linked to intracellular processes. To address this gap, we combined over 100 resources covering interactions and roles of proteins in inter- and intracellular signaling, as well as transcriptional and post-transcriptional regulation. We added protein complex information and annotations on function, localization, and role in diseases for each protein. The resource is available for human, and via homology translation for mouse and rat. The data are accessible via OmniPath’s web service (https://omnipathdb.org/), a Cytoscape plug-in, and packages in R/Bioconductor and Python, providing access options for computational and experimental scientists. We created workflows with tutorials to facilitate the analysis of cell–cell interactions and affected downstream intracellular signaling processes. OmniPath provides a single access point to knowledge spanning intra- and intercellular processes for data analysis, as we demonstrate in applications studying SARS-CoV-2 infection and ulcerative colitis. SYNOPSIS Over 100 resources are integrated into OmniPath, a comprehensive knowledge base of intra- and inter-cellular signaling. Workflows are provided and illustrated in case studies analyzing omics data in SARS-CoV-2 infection and ulcerative colitis. OmniPath includes 4,000,000 annotations for over 20,000 proteins. A new framework defining transmitter and receiver roles generalizes the concepts of ligand and receptor. Integrated analysis of intra- and intercellular signaling can be performed to study how cells affect each other in healthy and diseased conditions. Software tools and workflows in R and Python facilitate the analysis of bulk and single-cell omics data using tools such as CellPhoneDB, NicheNet and CARNIVAL. Introduction Cells process information by physical interactions of molecules. These interactions are organized into an ensemble of signaling pathways that are often represented as a network. This network determines the response of the cell under different physiological and disease conditions. In multicellular organisms, the behavior of each cell is regulated by higher levels of organization: the tissue and, ultimately, the organism. In tissues, multiple cells communicate to coordinate their behavior to maintain homeostasis. For example, cells may produce and sense extracellular matrix (ECM), and release enzymes acting on the ECM as well as ligands. These ligands are detected by receptors in the same or different cells, that in turn trigger intracellular pathways that control other processes, including the production of ligands and the physical binding to other cells. The totality of these processes mediates the intercellular communication in tissues. Thus, to understand physiological and pathological processes at the tissue level, we need to consider both the signaling pathways within each cell type as well as the communication between them. Since the end of the nineties, databases have been collecting information about signaling pathways (Xenarios et al, 2000). These databases provide a unified source of information in formats that users can browse, retrieve, and process. Signaling databases have become essential tools in systems biology and to analyze omics data. A few resources provide ligand–receptor interactions (Kirouac et al, 2010; Fazekas et al, 2013; Ramilowski et al, 2015; Armstrong et al, 2019; Efremova et al, 2020). However, their coverage is limited, they do not include some key players of intercellular communication such as matrix proteins or extracellular enzymes, and they are not integrated with intracellular processes. This is increasingly important as new techniques allow us to measure data from single cells, enabling the analysis of inter- and intracellular signaling. For example, the recent CellPhoneDB (Efremova et al, 2020) and ICELLNET (Noël et al, 2021) tools provide computational methods to prioritize the most likely intercellular connections from single-cell transcriptomics data, and NicheNet (Browaeys et al, 2019) expands this to intracellular gene regulation. A comprehensive resource of inter- and intracellular signaling knowledge would enhance and expedite these analyses. To effectively study multicellular communication, a resource should (i) classify proteins by their roles in intercellular communication, (ii) connect them by interactions from the widest possible range of resources, and (iii) integrate all this information in a transparent and customizable way, where the users can select the resources to evaluate their quality and features, and adapt them to their context and analyses. Prompted by the lack of comprehensive efforts addressing principle (i), we built a database on top of OmniPath (Türei et al, 2016), a resource which has already shown the benefits of principles (ii) and (iii). The first version of OmniPath focused on literature curated intracellular signaling pathways. It has been used in many computational projects and omics studies. For example, to model cell senescence from phosphoarray data (An et al, 2020), or as part of a computational pipeline to predict the effect of microbial proteins on human genes (Andrighetti et al, 2020), and a community effort to integrate knowledge about the COVID-19 disease mechanism (Ostaszewski et al, 2020). The new OmniPath extends its scope to intercellular communication and its integration with intracellular signaling, providing prior knowledge for modeling and analysis methods. It combines 103 resources to build an integrated database of molecular interactions, enzyme-PTM (post-translational modification) relationships, protein complexes and annotations about intercellular communication, and other functional attributes of proteins. We demonstrate with two case studies that we provide a versatile resource for the analysis of single-cell and bulk omics data. Leveraging the intercellular communication knowledge in OmniPath, we present two examples where autocrine and paracrine signaling are key parts of pathomechanism. First, we studied the potential influence of ligands secreted in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection on the inflammatory response through autocrine signaling. We identified signaling mechanisms that may lead to the dysregulated inflammatory and immune response shown in severe cases. Second, we examined the rewiring of cellular communication in ulcerative colitis (UC) based on single-cell data from the colon. By analyzing downstream signaling from the intercellular interactions, we found pathways associated with the regulatory T cells targeted by myofibroblasts in UC. Results We used four major types of resources: (i) molecular interactions, (ii) enzyme-PTM relationships, (iii) protein complexes, and (iv) molecule annotations about function, localization, and other attributes (Fig 1A). The pypath Python package combined the resources from those four types to build four corresponding integrated databases. Using the annotations, pypath compiled a fifth database about the roles in intercellular communication (intercell; Fig 1B). The ensemble of these five databases is what we call OmniPath, combining data from 103 resources (Fig 1A and Dataset EV1). Figure 1. The composition and workflow of OmniPath Database contents with the respective number of resources in parentheses. Workflow and design: OmniPath is based on four major types of resources, and the pypath Python package combines the resources to build five databases. The databases are available by the database builder software pypath, the web resource at https://omnipathdb.org/, the R package OmnipathR, the Python client omnipath, the Cytoscape plug-in and can be exported to formats such as Biological Expression Language (BEL). Download figure Download PowerPoint A focus on intercellular signaling To create a database of intercellular communication, we defined the roles that proteins play in this process. Ligands and receptors are main players of intercellular communication. Many other kinds of molecules have a great impact on the behavior of the cells, such as matrix proteins and transporters (Fig 2A). We defined eight major (Fig 2) and 17 minor generic functional categories of intercellular signaling roles (Datasets EV6 and EV10). We also defined ten locational categories (e.g., plasma membrane peripheral), using in addition structural resources and prediction methods to annotate the transmembrane, secreted and peripheral membrane proteins. Furthermore, we provide 994 specific categories (e.g., neurotrophin receptors). Each generic category can be accessed by resource (e.g., ligands from HGNC) or as the combination of all contributing resources (Fig EV4). To provide highly curated annotations, we checked every entry in each category manually against UniProt datasheets to exclude wrong annotations. Overall we defined 1,170 categories and provided 54,330 functional annotations about intercellular communication roles of 5,781 proteins. Figure 2. The composition and representation of the intercellular signaling network We assigned intercellular communication roles to proteins based on evidence from multiple resources. In all panels: —transmitter; —receiver. Schematic illustration of the intercellular communication roles and their possible connections. Cells are physically connected by proteins forming tight junctions (1), gap junctions (2), and other adhesion proteins (3); they release vesicles which can be taken up by other cells (4); some receptors form complexes (5) to detect secreted ligands (6); transporters might also be affected by factors released by other cells (8); enzymes released into the extracellular space act on ligands and the extracellular matrix (7, 9); cells release the components of the extracellular matrix and bind to the matrix by adhesion proteins (10). The main intercellular communication roles (x axis) and the major contributing resources (y axis). Size of the dots represents the number of proteins annotated to have a certain role in a given resource. The darker areas represent the overlaps (proteins annotated in more than one resource for the same role) while the lighter color denotes those unique to that resource. The intercellular communication network. The circle segments represent the eight main intercellular communication roles. The edges are proportional to the number of interactions in the OmniPath PPI network connecting proteins of one role to the other. Number of unique, directed transmitter–receiver (e.g., ligand–receptor) connections by resources. Bars on the right show the coverage of each resource on a textbook dataset of 131 well-known ligand–receptor interactions. Download figure Download PowerPoint Click here to expand this figure. Figure EV4. Example of the intercell query in the OmniPath web service Each category has a parent category and a database of origin. The scope of a category is either “generic” (e.g., ligand) or “specific” (e.g., interleukin). The aspect is either “locational” or “functional”. Further attributes show whether the protein is a signal transmitter or a receiver, and whether it is secreted, or a transmembrane or peripheral protein of the plasma membrane. Download figure Download PowerPoint We collected the proteins for each intercellular communication functional category using data from 27 resources (Fig 2B, Dataset EV6). Combining them with molecular interaction networks from 48 resources (Dataset EV2), we created a corpus of putative intercellular communication pathways (Fig 2C). To have a high coverage on the intercellular molecular interactions, we also included ten resources focusing on ligand–receptor interactions (Figs 3 and EV1). Figure 3. Quantitative description of the network, complex, and enzyme–PTM databases A–C. Networks by interaction types and the network datasets within the PPI network. (A) Number of nodes and interactions. The light dots represent the shared nodes and edges (in more than one resource), while the dark ones show their total numbers. (B) Causality: number of connections by direction and effect sign. (C) Coverage of the networks on various groups of proteins. Dots show the percentage of proteins covered by network resources for the following groups: cancer driver genes from COSMIC and IntOGen, kinases from kinase.com, phosphatases from Phosphatome.net, receptors from the Human Plasma Membrane Receptome (HPMR) and transcription factors from the TF census. Gray bars show the number of proteins in the networks. The information for individual resources is in Figs EV1 and EV2, Appendix Fig S1. D–G. On each panel, the bottom rows represent the combined complex and enzyme–PTM databases contained in OmniPath (D, E). Number of complexes (D) and enzyme–PTM (E) interactions by resource. (F) Enzyme–PTM relationships by PTM type. (G) Enzyme–PTM interactions by their target. Light, medium, and dark dots represent the number of enzyme–PTM relationships targeting the enzyme itself, another protein within the same molecular complex or an independent protein, respectively. Download figure Download PowerPoint Click here to expand this figure. Figure EV1. Quantitative description of the PPI network by resource Number of nodes and interactions. The light dots represent the shared nodes and edges (in more than one resource), while the dark ones show their total numbers. Causality: number of connections by direction and effect sign. Coverage of the networks on various groups of proteins. Dots show the percentage of proteins covered by network resources for the following groups: cancer driver genes from COSMIC and IntOGen, kinases from kinase.com, phosphatases from Phosphatome.net, receptors from the Human Plasma Membrane Receptome (HPMR) and transcription factors from the TF census. Gray bars show the number of proteins in the networks. Download figure Download PowerPoint Many of the proteins in intercellular communication work as parts of complexes. We therefore built a comprehensive database of protein complexes and inferred their intercellular communication roles: a complex belongs to a category if and only if all members of the complex belong to it. We obtained 14,348 unique, directed transmitter–receiver (e.g., ligand–receptor) connections, around seven times more than the largest of the resources providing such kind of data. We also mapped a textbook table (Cameron & Kelvin, 2013) of 131 cytokine–receptor interactions to the ligand–receptor resources. As the textbook contains well-known interactions, many of the resources cover more than 90% of them (Fig 2D). This large coverage is achieved by not only integrating ten ligand–receptor resources, but also complementing these with data from annotation and interaction resources. An essential feature of this novel resource is that it combines knowledge about intercellular and intracellular signaling (Table 1). Thus, using OmniPath one can, for example, easily analyze the intracellular pathways triggered by a given ligand or check the transcription factors (TFs) and microRNAs (miRNAs) regulating the expression of such ligands. Table 1. Qualitative comparison of ligand–receptor and integrative databases. Resource Inter-actions Directed inter-actions Signs (positive/negative) Transcriptional regulation Intracellular pathways Intercellular communication roles Protein complexes Integrative resource Literature curated Baccin2019 (e) yes yes (a) no no no yes (f) yes yes yes (g) CellCellInteractions yes yes (a) no no no yes (l) no yes no CellPhoneDB yes yes (a) no no no yes (d) yes yes yes ConsensusPathDB yes no no yes yes no no yes yes (g) EMBRACE (e) yes yes (a) no no no yes no yes (k) yes (g) HPMR yes yes (a) no no no yes no no yes ICELLNET yes yes (a) no no no yes yes no yes iTALK (h) yes yes (a) no no no yes no yes yes (g) Kirouac2010 yes yes (a) no no no yes no no yes LRdb yes yes (a) no no no yes no yes yes (g) PathwayCommons yes yes (m) no yes yes no yes yes yes (g) Ramilowski2015 yes yes (a) no no no yes no yes yes (g) SignaLink yes yes yes yes (i) yes yes no yes (j) yes (g) OmniPath yes yes (b) yes yes yes yes (c) yes yes yes (g) OmniPath combines resources to build a network with directions and effect signs, including intra- and intercellular signaling, transcriptional regulation, and annotates proteins as ligands or receptors. Here, we show which of these features are covered by other databases: those specialized in ligand–receptor interactions and two large integrative network databases (ConsensusPathDB and Pathway Commons). (a) Implicit: if we assume always the ligand affects the receptor; (b) As in some of the constituent resources the directions are implicit, certain directions in the combined network are implicit; (c) Provides not only ligand and receptor annotation but further categories, for example adhesion, transporter, ECM, etc; (d) Apart from secreted (mostly ligand) and receptor provides a few further categories: integrin, collagen, transmembrane, peripheral, etc; (e) Data are for mouse, homology translation is necessary to derive human data; (f) For ligands, provides certain classification, e.g., cytokine, ECM, secreted, etc; (g) Only in part is literature curated; (h) Ligand–receptor interactions are classified as growth factor, cytokine, checkpoint, or other; (i) Contains transcriptional regulation but that part is not integrated by OmniPath; (j) OmniPath only integrates its original literature curation, not the secondary resources; (k) Only builds on Ramilowski et al; (l) Besides ligand and receptor only ECM; (m) Directionality information might be extracted from BioPAX. OmniPath: an ensemble of five databases The abovementioned intercellular database exists in OmniPath together with four further databases (Fig 1B), supporting an integrated analysis of inter- and intracellular signaling. The network of molecular interactions The network database part covers four major domains of molecular signaling: (i) protein–protein interactions (PPI), (ii) transcriptional regulation of protein-coding genes, (iii) miRNA–mRNA interactions, and (iv) transcriptional regulation of miRNA genes (TF-miRNA). We further differentiated the PPI data into four subsets based on the interaction mechanisms and the types of supporting evidence: (i) literature curated activity flow (directed and signed; corresponds to the original release of OmniPath; Türei et al, 2016), (ii) activity flow with no literature references, (iii) enzyme–PTM, and (iv) ligand–receptor interactions (Fig 3A–C). Interaction data are extensively used for a variety of purposes: for building mechanistic models, deriving pathway and TF activities from transcriptomics data and graph-based analysis methods. In total, the resource contained 103,396 PPI interactions between 12,469 proteins from 38 original resources (Dataset EV2). The large number of unique interactions added by each resource underscores the importance of their integration (Figs EV1 and EV2, Appendix Fig S1). The interactions with effect signs, essential for mechanistic modeling, are provided by the activity flow resources (Appendix 1; Fig 3B). The combined PPI network covered 53% of the human proteome (SwissProt), with an enrichment of kinases and cancer driver genes (Fig 3C). The transcriptional regulation data in OmniPath were obtained from DoRothEA (Garcia-Alonso et al, 2019), a comprehensive resource of TF regulons integrating data from 18 sources. In addition, six literature curated resources were directly integrated into OmniPath (Dataset EV8). The miRNA–mRNA and TF–miRNA interactions were integrated from five and two literature curated resources, with 6,213 and 1,803 interactions, respectively. Combining multiple resources not only increases the coverage, but also improves quality. It makes it possible to select higher confidence records based on the number of resources and references. Cross-checking the interaction directions and effect signs between resources reveal contradictory information which is either a sign of mistakes or reflects on limitations of our data representation (Appendix 1; Appendix Figs S4). Overall, we included 61 network resources in OmniPath (Dataset EV2). Furthermore, pypath provides access to additional resources, including the Human Reference Interactome (Luck et al, 2020), ConsensusPathDB (Kamburov et al, 2013), Reactome (Jassal et al, 2020), ACSN (Kuperstein et al, 2015), and WikiPathways (Slenter et al, 2018). Click here to expand this figure. Figure EV2. Quantitative description of the transcriptional network by resource A–C. Panels and notations are the same as on Fig EV1. Download figure Download PowerPoint Enzyme-PTM relationships In enzyme–PTM relationships, enzymes (e.g., kinases) alter specific residues of their substrates, producing so-called post-translational modifications (PTM). Enzyme–PTM relationships are essential for deriving networks from phosphoproteomics data or estimating kinase activities. We combined 11 resources of enzyme–PTM relationships mostly covering phosphorylation (94% of all) and dephosphorylations (3%) (Fig 3F). Overall, we included 39,201 enzyme–PTM relationships, 1,821 enzymes targeting 16,467 PTM sites (Fig 3E–G). Besides phosphorylation and dephosphorylation, only proteolytic cleavage and acetylation account for more than one hundred interactions. Most of the databases curated only phosphorylation, and DEPOD (Damle & Köhn, 2019) exclusively dephosphorylation. Only SIGNOR (Licata et al, 2020) and HPRD (Keshava Prasad et al, 2009) contained a large number of other modifications (Fig 3F). 60% of the interactions were described by only one resource, and 92% of them by only one literature reference (Fig 3E). Self-modifications, e.g., autophosphorylation and modifications between members of the same complex comprised 4 and 18% of the interactions, respectively (Fig 3G). Protein complexes Many proteins operate in complexes, for example, receptors often detect ligands in complexes. To facilitate analyses taking into consideration complexes, we added to OmniPath a comprehensive collection of 22,005 protein complexes described by 12 resources from 4,077 articles (Fig 3D). A complex is defined by its combination of unique mem
DOI: 10.1016/j.stem.2015.05.014
2015
Cited 161 times
Transcriptional Mechanisms of Proneural Factors and REST in Regulating Neuronal Reprogramming of Astrocytes
Direct lineage reprogramming induces dramatic shifts in cellular identity, employing poorly understood mechanisms. Recently, we demonstrated that expression of Neurog2 or Ascl1 in postnatal mouse astrocytes generates glutamatergic or GABAergic neurons. Here, we take advantage of this model to study dynamics of neuronal cell fate acquisition at the transcriptional level. We found that Neurog2 and Ascl1 rapidly elicited distinct neurogenic programs with only a small subset of shared target genes. Within this subset, only NeuroD4 could by itself induce neuronal reprogramming in both mouse and human astrocytes, while co-expression with Insm1 was required for glutamatergic maturation. Cultured astrocytes gradually became refractory to reprogramming, in part by the repressor REST preventing Neurog2 from binding to the NeuroD4 promoter. Notably, in astrocytes refractory to Neurog2 activation, the underlying neurogenic program remained amenable to reprogramming by exogenous NeuroD4. Our findings support a model of temporal hierarchy for cell fate change during neuronal reprogramming.
DOI: 10.1038/nn.3963
2015
Cited 161 times
Fast clonal expansion and limited neural stem cell self-renewal in the adult subependymal zone
DOI: 10.1242/dev.173849
2019
Cited 159 times
Massive single-cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis
Deciphering mechanisms of endocrine cell induction, specification and lineage allocation in vivo will provide valuable insights into how the islets of Langerhans are generated. Currently, it is ill defined how endocrine progenitors segregate into different endocrine subtypes during development. Here, we generated a novel Neurogenin3 (Ngn3)-Venus fusion (NVF) reporter mouse line, that closely mirrors the transient endogenous Ngn3 protein expression. To define an in vivo roadmap of endocrinogenesis, we performed single-cell RNA-sequencing of 36,351 pancreatic epithelial and NVF+ cells during secondary transition. This allowed to time-resolve and distinguish Ngn3low endocrine progenitors, Ngn3high endocrine precursors, Fev+ endocrine lineage and hormone+ endocrine subtypes and delineate molecular programs during the stepwise lineage restriction steps. Strikingly, we identified 58 novel signature genes that show the same transient expression dynamics as Ngn3 in the 7,260 profiled Ngn3-expressing cells. The differential expression of these genes in endocrine precursors associated with their cell-fate allocation towards distinct endocrine cell types. Thus, the generation of an accurately regulated NVF reporter allowed us to temporally resolve endocrine lineage development to provide a fine-grained single-cell molecular profile of endocrinogenesis in vivo.
DOI: 10.1016/j.celrep.2014.12.032
2015
Cited 158 times
Stem-Cell-like Properties and Epithelial Plasticity Arise as Stable Traits after Transient Twist1 Activation
<h2>Summary</h2> Master regulators of the epithelial-mesenchymal transition such as Twist1 and Snail1 have been implicated in invasiveness and the generation of cancer stem cells, but their persistent activity inhibits stem-cell-like properties and the outgrowth of disseminated cancer cells into macroscopic metastases. Here, we show that Twist1 activation primes a subset of mammary epithelial cells for stem-cell-like properties, which only emerge and stably persist following Twist1 deactivation. Consequently, when cells undergo a mesenchymal-epithelial transition (MET), they do not return to their original epithelial cell state, evidenced by acquisition of invasive growth behavior and a distinct gene expression profile. These data provide an explanation for how transient Twist1 activation may promote all steps of the metastatic cascade; i.e., invasion, dissemination, and metastatic outgrowth at distant sites.
DOI: 10.1038/nbt.3626
2016
Cited 156 times
Software tools for single-cell tracking and quantification of cellular and molecular properties
DOI: 10.1126/science.aaa2729
2015
Cited 154 times
Live imaging of adult neural stem cell behavior in the intact and injured zebrafish brain
Adult neural stem cells are the source for restoring injured brain tissue. We used repetitive imaging to follow single stem cells in the intact and injured adult zebrafish telencephalon in vivo and found that neurons are generated by both direct conversions of stem cells into postmitotic neurons and via intermediate progenitors amplifying the neuronal output. We observed an imbalance of direct conversion consuming the stem cells and asymmetric and symmetric self-renewing divisions, leading to depletion of stem cells over time. After brain injury, neuronal progenitors are recruited to the injury site. These progenitors are generated by symmetric divisions that deplete the pool of stem cells, a mode of neurogenesis absent in the intact telencephalon. Our analysis revealed changes in the behavior of stem cells underlying generation of additional neurons during regeneration.
DOI: 10.1038/s41587-021-01182-1
2022
Cited 153 times
Spatial components of molecular tissue biology
Methods for profiling RNA and protein expression in a spatially resolved manner are rapidly evolving, making it possible to comprehensively characterize cells and tissues in health and disease. To maximize the biological insights obtained using these techniques, it is critical to both clearly articulate the key biological questions in spatial analysis of tissues and develop the requisite computational tools to address them. Developers of analytical tools need to decide on the intrinsic molecular features of each cell that need to be considered, and how cell shape and morphological features are incorporated into the analysis. Also, optimal ways to compare different tissue samples at various length scales are still being sought. Grouping these biological problems and related computational algorithms into classes across length scales, thus characterizing common issues that need to be addressed, will facilitate further progress in spatial transcriptomics and proteomics.
DOI: 10.15252/msb.202110282
2021
Cited 139 times
RNA velocity—current challenges and future perspectives
Review26 August 2021Open Access RNA velocity—current challenges and future perspectives Volker Bergen Volker Bergen Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany Department of Mathematics, Technical University of Munich, Munich, Germany Search for more papers by this author Ruslan A Soldatov Ruslan A Soldatov Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Search for more papers by this author Peter V Kharchenko Peter V Kharchenko Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Search for more papers by this author Fabian J Theis Corresponding Author Fabian J Theis [email protected] orcid.org/0000-0002-2419-1943 Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany Department of Mathematics, Technical University of Munich, Munich, Germany Search for more papers by this author Volker Bergen Volker Bergen Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany Department of Mathematics, Technical University of Munich, Munich, Germany Search for more papers by this author Ruslan A Soldatov Ruslan A Soldatov Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Search for more papers by this author Peter V Kharchenko Peter V Kharchenko Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA Search for more papers by this author Fabian J Theis Corresponding Author Fabian J Theis [email protected] orcid.org/0000-0002-2419-1943 Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany Department of Mathematics, Technical University of Munich, Munich, Germany Search for more papers by this author Author Information Volker Bergen1,2, Ruslan A Soldatov3, Peter V Kharchenko3 and Fabian J Theis *,1,2 1Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany 2Department of Mathematics, Technical University of Munich, Munich, Germany 3Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA *Corresponding author. Tel: +49 89 3187 2211; E-mail: [email protected] Molecular Systems Biology (2021)17:e10282https://doi.org/10.15252/msb.202110282 PDFDownload PDF of article text and main figures. ToolsAdd to favoritesDownload CitationsTrack CitationsPermissions ShareFacebookTwitterLinked InMendeleyWechatReddit Figures & Info Abstract RNA velocity has enabled the recovery of directed dynamic information from single-cell transcriptomics by connecting measurements to the underlying kinetics of gene expression. This approach has opened up new ways of studying cellular dynamics. Here, we review the current state of RNA velocity modeling approaches, discuss various examples illustrating limitations and potential pitfalls, and provide guidance on how the ensuing challenges may be addressed. We then outline future directions on how to generalize the concept of RNA velocity to a wider variety of biological systems and modalities. Background A central challenge in studying cellular dynamics in single-cell genomics is that single-cell RNA-seq provides only static snapshots of cellular states at the moment of the measurement, instead of following cells over time. The concept of RNA velocity (La Manno et al, 2018) has unlocked new ways of studying cellular dynamics by granting access to not only the descriptive state of a cell, but also to its direction and speed of movement in transcriptome space, thereby enabling predictive models of cell dynamics. RNA velocity recovers directed information by distinguishing newly transcribed pre-mRNAs (unspliced) from mature mRNAs (spliced), which can be detected in standard single-cell RNA-seq protocols from the presence of introns. The change in mRNA abundance, termed RNA velocity, is inferred by a per-gene reaction model that relates the abundance of unspliced and spliced mRNA (Fig 1A). Positive velocity indicates a recent increase in unspliced transcripts (thus abundances being higher than expected in steady state) followed by up-regulation in spliced transcripts. Conversely, negative velocity indicates down-regulation (Fig 1B). The combination of velocities across genes is then used to estimate the future state of an individual cell (Fig 1C). Figure 1. Current state of RNA velocity modeling (A) Transcription of pre-mRNAs, their conversion into spliced mRNAs, and eventual degradation. Current RNA velocity modeling approaches use basic reaction kinetics for each gene independently and formulate deterministic differential equations with linear dependencies, assuming constant rates. The system is decoupled across genes and does not account for transcriptional regulation. (B) The temporal response delay of pre-mRNA being spliced into mature mRNA manifests itself in the curvature in phase space and is leveraged to model and estimate RNA velocity for each gene. Velocity is obtained from the residual of the observed ratio to the inferred steady-state ratio, i.e., the ratio of degradation to splicing rate. (C) The combination of velocities across genes is used to extrapolate the future state of an individual cell. Download figure Download PowerPoint Recent advances have extended the concept to dynamic populations and enabled inference of reaction rates, reconstruction of time, and detection of transiently expressed genes from the underlying kinetics (Bergen et al, 2020). It has been shown that a small subset of dynamical genes commonly informs the reconstruction of the entire velocity vector field. This observation illustrated that in most scenarios, only a small number of genes appear to obey simple interpretable kinetics used by RNA velocity, which creates a major challenge in interpreting RNA velocity results. While RNA velocity has been taken up in a series of applications as summarized recently (Lederer & La Manno, 2020); here, we focus on its underlying modeling concepts, limitations, and possible extensions. In particular, we discuss issues that can lead to misspecification of transcriptional models and outline potential conceptual and technical model extensions that may resolve these limitations and generalize the concept of RNA velocity. Our documented case study can be found at: https://scvelo.org/perspectives. Current state, model assumptions, and potential pitfalls Currently, two modeling approaches exist that leverage expression kinetics to estimate RNA velocity—the originally proposed “steady-state” model velocyto and the subsequently extended dynamical model scVelo. The steady-state model (La Manno et al, 2018) estimates velocities as the deviation of the observed ratio of unspliced to spliced mRNA from an inferred steady-state ratio. The steady-state ratio is approximated with a linear regression on cells found in the lower and upper quantiles where they are expected to have reached steady-state expression levels. This model makes two central assumptions: a common splicing rate across genes and the presence of at least partial observation of the steady-state expression levels in the sampled data. Although providing robust estimation, these assumptions may lead to errors in velocity estimates and cellular states when they are violated, e.g., due to heterogeneous subpopulation or inability to observe the system near its steady state. The likelihood-based dynamical model, introduced recently, generalizes velocity estimation to transient systems (Bergen et al, 2020). While it relaxes the steady-state assumption, it remains that the kinetics are explained with a deterministic and fully decoupled system of linear differential equations with constant kinetic rate parameters. Beyond the scope of computational modeling, the statistical power of the methods depends on the curvature in the phase portrait since a lack of curvature challenges current models to distinguish whether an up- or down-regulation is occurring. The overall curvature of deviation from the steady-state line in the phase portrait is mostly impacted by the ratios of splicing to degradation rates (Box 1), indicating that statistical inference is limited to genes where splicing is faster or comparable to degradation, while a small ratio would yield straight lines rather than an interpretable curvature. Note, that this lack of signal is highly gene-specific. Another source of ambiguity only revealing straight lines is the incomplete scope of observation of dynamic processes, which we frequently find in subpopulations because of partially observed expression kinetics, e.g., being upregulated only at the very end or downregulated at the very beginning of a process. To demonstrate potential pitfalls, we provide several examples that disclose different types of limitations of current modeling approaches (Fig 2). First, as described in the seminal works (La Manno et al, 2018; Bergen et al, 2020), some genes show multiple kinetic regimes across subpopulations and lineages (Fig 2A). These can be governed by variations in splicing to degradation rates ratios and manifest as multiple trajectories in phase space. Second, as recently shown in mouse gastrulation (Pijuan-Sala et al, 2019; Barile et al, 2021), a boost in expression has been observed in erythroid maturation, possibly induced by a change in transcription rate (Fig 2B). We made the same observation in human bone marrow CD34+ hematopoietic cells (Setty et al, 2019). This up-regulating boost in expression would incorrectly yield negative velocity estimates indicating down-regulation. Third, a common example of incomplete scope is the observation of only steady-state populations. Thus, we examined erroneously inferred directions in terminal cell types in PBMCs (Zheng et al, 2017), where we would not have expected any explicit cell type transition (Fig 2C). Genes not showing any transient states can be explained by high noise levels. However, in this example it is more likely that cells are mostly sampled in mature states, where mRNA levels have already equilibrated and intermediate states leading to these equilibria have not been sampled. Despite the lack of dynamic information, we still obtain arbitrary erroneous directions. To confirm that these directions indeed arise from distorted estimates and their projection, we show that the directions were also inferred even if using three top-likelihood selected genes only (NKG7, IGHM, and GNLY) all of which show noisy phase portraits without any indication of cell type transitions. Hence, the unexpected projected directions are likely due to velocities being estimated independently of noise levels and uncertainty in estimates not being propagated into the projection. A simulation of mature cell types further supports the possibility of false projections as projected arrows are obtained that are not seen in the ground-truth vector field (Fig 2C). Finally, we investigated potential issues in hematopoiesis, using cord blood CD34+ cells, where we obtain a direction reversal from what is biologically expected. In CD99 and CD44, we observe complex characteristics that cannot be resolved by current models: a simultaneous up- and down-regulation during their transition from HSCs toward different fates. In RBPMS, we find misleading concavity patterns where we would have expected a convex curve, causing a direction reversal not only gene-specific but even in the projected arrows (Fig 2D), which can be explained by time-dependent rates. Experimental approaches that started elucidating time-dependent mRNA turnover reveal frequent modulation of kinetic rates in time (Battich et al, 2020). Motivated by these examples, we explored how time-dependent kinetic rates shape the curvature of gene activation. Simulations show how time-dependent rates can reshape curvature patterns: Variable synthesis rates deflate curvature (Fig 3A); slowly decreasing degradation and increasing splicing rates inflate curvature, while slowly increasing degradation and decreasing splicing rates flip curvature (Fig 3B and C). Figure 2. Examples illustrating limitations of current RNA velocity models (A) A UMAP-based representation (left) and gene unspliced/spliced phase portraits (right) of Dentate Gyrus neurogenesis, adapted and reanalyzed from Bergen et al, 2020 (Suppl. Fig 11) and La Manno et al, 2018 (Suppl. Fig 7). These genes show multiple kinetic regimes across subpopulations and lineages, possibly governed by different kinetic rates, and manifested as multiple trajectories/slopes. For instance, the endothelial subpopulation in Tmsb10 yields positive velocity estimates indicating up-regulation, although it can be unambiguously estimated given only a slope distinct from the main granule lineage. To resolve these multiple regimes, it requires a model that identifies these regimes and allows for variable kinetic rates. (B) Erythroid maturation in mouse gastrulation (top) and human bone marrow CD34+ hematopoietic cells (bottom) that show transcriptional boosts in expression possibly induced by a change in transcription rate. Data from Setty et al (2019), Barile et al (2021). (C) Peripheral blood mononuclear cells (PBMCs) from Zheng et al (2017) with mature cell types. Arbitrary directions are projected onto the UMAP representation (left) even though velocity estimates are used from three genes only (right) that show no transient states. Expected would have been a noisy vector field that is not pointing into any particular direction. That shows the possibility of false projections that are not supported by gene-wise dynamics. Simulated data of mature cell types support this observation of possible false projections that are not seen in the ground-truth vector field. (D) Cord blood CD34+ hematopoietic cells with complex kinetics that shows simultaneous up- and down-regulation during the transition from HSCs toward different fates of megakaryocyte/erythrocyte (MEPs), granulocyte/macrophage (GMPs), and early lymphocyte progenitors (ELP). RBPMS even shows misleading concavity patterns causing a direction reversal. The possibility of reserved directions can be explained by time-dependent degradation rates, as demonstrated using simulated data. CD34+ cord blood cell data are unpublished. Download figure Download PowerPoint Figure 3. Time-variable kinetic rates shape curvature of gene activation (A) Time-dependent kinetic rates shape the curvature patterns of gene activation. A slow increase in transcription rate rather than a stepwise activation deflates the curvature and thus decreases the statistical power. (B) A slow increase in splicing rates inflates the curvature while a slow decrease in splicing rates flips the curvature. That results in a convex curve, which yields negative velocities and gets incorrectly interpreted as down-regulation. In the worst case, this can also cause a direction reversal in the projected velocities. (C) The impact of time-dependent degradation rates is inverse to time-dependent splicing rates. A slow decrease in degradation rates inflates the curvature while a slow increase in degradation rates flips the curvature. Download figure Download PowerPoint Box 1: Kinetic signal (overall curvature) is determined by the ratio of splicing and degradation rate, and the rate of transcription convergence Consider the differential equation d u d t = α - β u , d s d t = δ u - γ s , where the splicing rate parameters β and δ are treated differently for generality to account for technical effects such as amplification biases. The analytical solution is given by u t = u 0 e - γ t + α β 1 - e - β t , s t = s 0 e - γ t + δ β α γ 1 - e - γ t + δ β α - β u 0 γ - β e - γ t - e - β t . The kinetic signal is given by the concavity of the residuals (for up-regulation, while convexity for down-regulation). Assuming s 0 = u 0 = 0 , the residuals are given by r t = u - γ δ s = α γ - β e - β t - e - γ t . The overall deviation from the equilibrium line is given by integration over the residuals C = ∫ 0 ∞ r t d s t = ∫ 0 ∞ r t d s d t d t = 1 2 β γ + β α β δ α β γ = 1 2 β γ + β s steady u steady . When allowing a time-dependent gradually increasing transcription rate α t = α 1 - e - λ α t , then the overall curvature is given by C λ α = C · 1 - β γ β + λ α γ + λ α . These equations have three important implications: β γ + β is the kinetic characteristic of statistical power, which notably depends only on the unbiased rate parameters of splicing and degradation, ranging from 0 (straight line) to 1 (maximally pronounced curvature). s steady u steady is the detection power, which is important for practical settings as noise levels can be regarded as a function of expression levels. A gradual increase in synthesis rate through λ α deflates the curvature pattern. Conceptual extensions and future directions Most of the challenges can be addressed with conceptual model extensions. Here, we will describe possible extensions to account for more complex kinetics, stochasticity, gene regulation, multivariate, and omics readouts (Fig 4). Figure 4. Conceptual future directions and model extensions (A) Modulations of transcription, splicing, and degradation rates by more complex mechanisms, including transcriptional bursts, alternative splicing events, and regulation of mRNA stability, suggest extended kinetic extensions such as modeling time- and state-variable kinetic rates. (B) Stochastic variability may be leveraged to capture the bursting nature of transcription, to improve parameter identifiability and to identify other sources of heterogeneity in kinetic rates that can be informative during cell fate decisions, when epigenetic priming or environmental signals guide cellular decisions. (C) The gene expression model can be extended to not only describe cell-state transitions, but also regulatory interactions along these transitions. (D) In addition to exonic and intronic signals, other molecular moieties can be incorporated into the model, such as protein measurements, metabolically labeled mRNAs, cytoplasmic mRNA, or chromatin state. The statistical signal as defined by the curvature is mostly determined by the ratio of rates at which the expression levels of the two modalities decay (ratio of splicing and degradation for RNA velocity), which may be improved through incorporation of other moieties. Download figure Download PowerPoint Extended kinetic models for gene dynamics The transcriptional kinetics is currently modeled as simple first-order equations with constant kinetic rate parameters. The fact that only a subset of genes follows simple kinetics is partly due to modulations in transcription, splicing, and degradation rates by more complex mechanisms. Kinetic rates can be dynamically regulated as demonstrated in neurogenesis and hematopoiesis (Fig 2A and B). In particular, recent metabolic RNA-labeling experiments, which quantify preexisting and labeled newly synthesized transcripts at a single-cell level, uncovered diverse behaviors of kinetic rates during in vitro differentiation of intestinal stem cells and cell cycle (Battich et al, 2015). Variable kinetic rates either between cell states or during a dynamic process can lead to phase portraits that have a misleading interpretation through the lens of existing RNA velocity models. We expect extensions of RNA velocity kinetic modeling that account for dynamic changes in kinetic rates (Fig 4A). These models will improve the quality of RNA velocity predictions, when accounting for alternative processes that modulate the transcription machinery, splicing, and mRNA stability. Additionally, such state-variable models will provide insights into transcriptional and post-transcriptional regulatory processes that control gene expression dynamics. The latter may also enable kinetics to be modeled in time series designs. If the underlying kinetic rate parameters are state-dependent, thus discretely changing, it should be possible to identify them upon classifying cells into their kinetic regimes. Identification of time-variable rates, however, will require additional constraints such as a pseudotime prior, optimal transport with marginal constraints in time course measurements (Schiebinger et al, 2019), or some other form of regularization. Finally, statistical quantification of changes in kinetic rates of analogous cell types under different conditions (e.g., health vs. disease) will allow us to identify condition-specific dynamics. Stochastic models for cell-specific dynamics Expression kinetics are inherently stochastic, driven by random biophysical interactions involved in the activity of the RNA synthesis and turnover machinery. The randomness of such biomolecular interactions coupled with the seemingly contradictory aspect of precise coordination allows cells to explore broader regimes, e.g., to differentiate toward multiple fates. Such mechanisms include the bursting nature of transcription, which indicates stochastic synthesis rates. Similarly, the noise induced by small copy numbers of a given transcript in a cell and the limited amount of material available per cell contribute to variations across cells and, consequently, variations in cellular decision making. While in systems biology, it has been shown that these may be leveraged for better model identification (Munsky et al, 2009) or in the modeling of cellular decision making using diffusion processes (Haghverdi et al, 2016), this stochasticity is currently ignored in RNA velocity modeling: The models describe the kinetics by deterministic differential equations, which do not allow to identify other sources of heterogeneity in kinetic rates, such as those imposed by external factors or unmeasured internal cell properties (Hahl & Kremling, 2016). These sources of heterogeneity can have important implications and may be informative during cell fate decisions (Raj & van Oudenaarden, 2008), when epigenetic priming or environmental signals guide different cellular decisions of transcriptionally similar cells, e.g., at decision forks (Soldatov et al, 2019). While RNA velocity provides a local estimate of cellular kinetics, global cell fate trajectories may be inferred through Markov chain transitions along the expression manifold (La Manno et al, 2018; Bergen et al, 2020) or between cellular states (preprint: Lange et al, 2020), which we expect to further improve when explored at the level of stochastic kinetic modeling. In the future, we are expecting non-deterministic models of RNA velocity, thus allowing improved detection rates to account for cell type-specific or even cell-specific kinetic rates (Fig 4B). The resulting more accurate single-cell estimates will further enable us to move from a deterministic limit to an estimated distribution of possible directions of a cell in an observed state, e.g., to facilitate cell fate bifurcation analysis. Such stochastic, cell-specific models, combined with the inference of cell division and death rates, will further enable dynamic inference over large expression manifolds and a better understanding of transitions between cellular states. Multivariate models toward system dynamics Dynamic changes in gene expression are orchestrated by transcriptional and post-transcriptional regulations. As shown in the example of erythroid maturation from gastrulation and human bone marrow, a transcriptional boost in expression can be induced by some upstream regulators (Fig 2B). At the current stage, the model for transcriptional dynamics is fully decoupled; i.e., each gene is treated independently, and regulatory relationships are ignored. The dynamical gene expression model can be extended to a multivariate model that describes not only cell-state transitions, but also regulatory interactions along these transitions. Regulatory events can be observed statistically in expression changes along pseudotime. To describe these events, the expression patterns of target genes can be modeled as a function of transcription factor activities, ideally treated as a nonlinear system, for instance, using Hill kinetics. A comprehensive evaluation of network modeling algorithms demonstrates that none of the currently available methods are capable of accurately recovering network structures from single-cell expression data alone, and the effort of inferring gene regulatory networks is still in its infancy (Pratapa et al, 2020). A recent analysis, however, indicates that the inclusion of RNA velocity information enables at least partial recovery of a regulatory network compared with pseudotime-based approaches (Qiu et al, 2020). It opens an avenue to generative approaches that model the known mRNA velocities as a function of expression state to infer the underlying gene regulatory network (Fig 4C). Using learned networks, we can generate new trajectories and testable hypotheses from transcription factor activity, for instance, to understand perturbational responses. Finally, an ultimate multivariate approach would jointly model the unknown RNA velocities and the underlying regulatory network from observed expression states and interpretable models of expression kinetics. Although efficient inference of the coupled system may quickly become challenging, such a joint model allows us to better understand fate decisions and reveal regulatory mechanisms of lineage priming. Furthermore, technological advances and the inclusion of new functional genomic layers, such as transcription factor binding, regulatory sequence motifs, chromatin modifications, and intermediaries such as RNA polymerase activity, hold great promise. These additional readouts will provide informative priors on the regulatory network and extend specifications of kinetic models. Multi-modal omics models RNA velocity is grounded in connecting measurements to an underlying mechanism (mRNA splicing), with two modalities representing the current and future state. In addition to exonic and intronic signals, other omics and molecular moieties can be leveraged if such measurements are available in an unbiased manner (Lederer & La Manno, 2020). Exploring other modalities becomes particularly crucial for systems, where the transcriptional dynamics of mRNA splicing does not provide sufficient signal, e.g., if splicing rate is relatively small as opposed to a large degradation rate (Box 1, Fig 4D). This issue of insufficient signal presents a challenge for the current mRNA splicing models, but may be resolvable, for instance, through analysis of other modalities, e.g., using protein dynamics, where we could expect the kinetic characteristic of statistical power (Box 1) to increase from 0.5 to 0.8 (Fig 4D), when assuming a fivefold half-life in proteins as opposed to RNA. For moieties such as capped, polyadenylated, and degraded transcript fragments or protein abundance, the model extension is straightforward upon revising the underlying assumptions and moiety-specific statistical model while ensuring reliable quantification. Experimental information on the molecular compartments such as separation of nuclear vs. cytoplasmic balance (Xia et al, 2019) using spatially resolved MERFISH protocol can also be incorporated into the model. Furthermore, models can be extended to incorporate epigenetic and regulatory information based on single-cell chromatin accessibility or other epigenetic data (Ma et al, 2020). Ultimately, velocity estimation relies on accurate quantification of abundances. Experiments indicate that intronic reads are only noisy approximations of nascent transcription (Erhard et al, 2019) and approaches for improving this quantification would be helpful. On the experimental side, relative abundances can be directly inferred using in vitro metabolic labeling (Erhard et al, 2019; Battich et al, 2020; Cao et al, 2020). This additional readout can be included in the dynamical model, incorporating varying labeling lengths as additional priors. It may also be possible to boost the detection of intronic molecules or reduce background from non-coding and antisense RNAs through improved preprocessing steps. On the computational side, additional structural features of the reads and gene-specific models of spliced vs. unspliced read patterns may improve the signal-to-noise ratio (Fig 4D). Technical challenges and extensions Here, we outline technical challenges that impact the modeling, such as normalization, batch effects, and gene selection, and in parts discuss how to address them. Cell size normalization Current RNA velocity approaches provide normalization by size factors proportional to the count depth per cell, and variations of such. However, cell size also reflects the natural extension of the reservoir of RNA transcription. It is not entirely clear how to best account for the cell count depth, whether to normalize intronic and exonic matrices to matrix-specific factors, to shared factors, or even to not normalize at all. More generally, we should investigate how changes in global cellular parameters, such as splicing efficiency or abundance of RNA polymerases, affect the kinetic models. Normalization by cell size is a simple way to remove the effects of count sampling, but it can also distort these effects in a non-trivial manner. Adequate preprocessing and ideally the inclusion of these effects into the model are crucial for accurate velocity estimates. Estimation from single-nucleus data Transcriptional measurements from individual nuclei enable the analysis of tissues where whole cell isolation is challenging (Slyper et al, 2020). The physical iso
DOI: 10.1101/2021.12.16.473007
2021
Cited 127 times
anndata: Annotated data
Summary anndata is a Python package for handling annotated data matrices in memory and on disk ( github.com/theislab/anndata ), positioned between pandas and xarray. anndata offers a broad range of computationally efficient features including, among others, sparse data support, lazy operations, and a PyTorch interface. Statement of need Generating insight from high-dimensional data matrices typically works through training models that annotate observations and variables via low-dimensional representations. In exploratory data analysis, this involves iterative training and analysis using original and learned annotations and task-associated representations. anndata offers a canonical data structure for book-keeping these, which is neither addressed by pandas (McKinney, 2010), nor xarray (Hoyer &amp; Hamman, 2017), nor commonly-used modeling packages like scikit-learn (Pedregosa et al., 2011).
DOI: 10.1038/s41467-021-27150-6
2021
Cited 111 times
scCODA is a Bayesian model for compositional single-cell data analysis
Abstract Compositional changes of cell types are main drivers of biological processes. Their detection through single-cell experiments is difficult due to the compositionality of the data and low sample sizes. We introduce scCODA ( https://github.com/theislab/scCODA ), a Bayesian model addressing these issues enabling the study of complex cell type effects in disease, and other stimuli. scCODA demonstrated excellent detection performance, while reliably controlling for false discoveries, and identified experimentally verified cell type changes that were missed in original analyses.
DOI: 10.1038/s41591-023-02327-2
2023
Cited 110 times
An integrated cell atlas of the lung in health and disease
Abstract Single-cell technologies have transformed our understanding of human tissues. Yet, studies typically capture only a limited number of donors and disagree on cell type definitions. Integrating many single-cell datasets can address these limitations of individual studies and capture the variability present in the population. Here we present the integrated Human Lung Cell Atlas (HLCA), combining 49 datasets of the human respiratory system into a single atlas spanning over 2.4 million cells from 486 individuals. The HLCA presents a consensus cell type re-annotation with matching marker genes, including annotations of rare and previously undescribed cell types. Leveraging the number and diversity of individuals in the HLCA, we identify gene modules that are associated with demographic covariates such as age, sex and body mass index, as well as gene modules changing expression along the proximal-to-distal axis of the bronchial tree. Mapping new data to the HLCA enables rapid data annotation and interpretation. Using the HLCA as a reference for the study of disease, we identify shared cell states across multiple lung diseases, including SPP1 + profibrotic monocyte-derived macrophages in COVID-19, pulmonary fibrosis and lung carcinoma. Overall, the HLCA serves as an example for the development and use of large-scale, cross-dataset organ atlases within the Human Cell Atlas.
DOI: 10.1186/s13059-021-02519-4
2021
Cited 91 times
Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape
Abstract Recent years have seen a revolution in single-cell RNA-sequencing (scRNA-seq) technologies, datasets, and analysis methods. Since 2016, the scRNA-tools database has cataloged software tools for analyzing scRNA-seq data. With the number of tools in the database passing 1000, we provide an update on the state of the project and the field. This data shows the evolution of the field and a change of focus from ordering cells on continuous trajectories to integrating multiple samples and making use of reference datasets. We also find that open science practices reward developers with increased recognition and help accelerate the field.
DOI: 10.1038/s41467-021-21352-8
2021
Cited 84 times
Deep learning the collisional cross sections of the peptide universe from a million experimental values
Abstract The size and shape of peptide ions in the gas phase are an under-explored dimension for mass spectrometry-based proteomics. To investigate the nature and utility of the peptide collisional cross section (CCS) space, we measure more than a million data points from whole-proteome digests of five organisms with trapped ion mobility spectrometry (TIMS) and parallel accumulation-serial fragmentation (PASEF). The scale and precision (CV &lt; 1%) of our data is sufficient to train a deep recurrent neural network that accurately predicts CCS values solely based on the peptide sequence. Cross section predictions for the synthetic ProteomeTools peptides validate the model within a 1.4% median relative error ( R &gt; 0.99). Hydrophobicity, proportion of prolines and position of histidines are main determinants of the cross sections in addition to sequence-specific interactions. CCS values can now be predicted for any peptide and organism, forming a basis for advanced proteomics workflows that make full use of the additional information.
DOI: 10.1126/science.abo1984
2022
Cited 82 times
Pathogenic variants damage cell composition and single cell transcription in cardiomyopathies
Pathogenic variants in genes that cause dilated cardiomyopathy (DCM) and arrhythmogenic cardiomyopathy (ACM) convey high risks for the development of heart failure through unknown mechanisms. Using single-nucleus RNA sequencing, we characterized the transcriptome of 880,000 nuclei from 18 control and 61 failing, nonischemic human hearts with pathogenic variants in DCM and ACM genes or idiopathic disease. We performed genotype-stratified analyses of the ventricular cell lineages and transcriptional states. The resultant DCM and ACM ventricular cell atlas demonstrated distinct right and left ventricular responses, highlighting genotype-associated pathways, intercellular interactions, and differential gene expression at single-cell resolution. Together, these data illuminate both shared and distinct cellular and molecular architectures of human heart failure and suggest candidate therapeutic targets.
DOI: 10.1101/2022.03.10.483747
2022
Cited 64 times
An integrated cell atlas of the human lung in health and disease
ABSTRACT Organ- and body-scale cell atlases have the potential to transform our understanding of human biology. To capture the variability present in the population, these atlases must include diverse demographics such as age and ethnicity from both healthy and diseased individuals. The growth in both size and number of single-cell datasets, combined with recent advances in computational techniques, for the first time makes it possible to generate such comprehensive large-scale atlases through integration of multiple datasets. Here, we present the integrated Human Lung Cell Atlas (HLCA) combining 46 datasets of the human respiratory system into a single atlas spanning over 2.2 million cells from 444 individuals across health and disease. The HLCA contains a consensus re-annotation of published and newly generated datasets, resolving under- or misannotation of 59% of cells in the original datasets. The HLCA enables recovery of rare cell types, provides consensus marker genes for each cell type, and uncovers gene modules associated with demographic covariates and anatomical location within the respiratory system. To facilitate the use of the HLCA as a reference for single-cell lung research and allow rapid analysis of new data, we provide an interactive web portal to project datasets onto the HLCA. Finally, we demonstrate the value of the HLCA reference for interpreting disease-associated changes. Thus, the HLCA outlines a roadmap for the development and use of organ-scale cell atlases within the Human Cell Atlas.
DOI: 10.1038/s41587-022-01467-z
2022
Cited 59 times
Modeling intercellular communication in tissues using spatial graphs of cells
Abstract Models of intercellular communication in tissues are based on molecular profiles of dissociated cells, are limited to receptor–ligand signaling and ignore spatial proximity in situ. We present node-centric expression modeling, a method based on graph neural networks that estimates the effects of niche composition on gene expression in an unbiased manner from spatial molecular profiling data. We recover signatures of molecular processes known to underlie cell communication.
DOI: 10.1038/s41587-023-01733-8
2023
Cited 56 times
The scverse project provides a computational ecosystem for single-cell omics data analysis
DOI: 10.1289/ehp9413
2022
Cited 48 times
Effect of Atmospheric Aging on Soot Particle Toxicity in Lung Cell Models at the Air–Liquid Interface: Differential Toxicological Impacts of Biogenic and Anthropogenic Secondary Organic Aerosols (SOAs)
Secondary organic aerosols (SOAs) formed from anthropogenic or biogenic gaseous precursors in the atmosphere substantially contribute to the ambient fine particulate matter [PM ≤2.5μm in aerodynamic diameter (PM2.5)] burden, which has been associated with adverse human health effects. However, there is only limited evidence on their differential toxicological impact.We aimed to discriminate toxicological effects of aerosols generated by atmospheric aging on combustion soot particles (SPs) of gaseous biogenic (β-pinene) or anthropogenic (naphthalene) precursors in two different lung cell models exposed at the air-liquid interface (ALI).Mono- or cocultures of lung epithelial cells (A549) and endothelial cells (EA.hy926) were exposed at the ALI for 4 h to different aerosol concentrations of a photochemically aged mixture of primary combustion SP and β-pinene (SOAβPIN-SP) or naphthalene (SOANAP-SP). The internally mixed soot/SOA particles were comprehensively characterized in terms of their physical and chemical properties. We conducted toxicity tests to determine cytotoxicity, intracellular oxidative stress, primary and secondary genotoxicity, as well as inflammatory and angiogenic effects.We observed considerable toxicity-related outcomes in cells treated with either SOA type. Greater adverse effects were measured for SOANAP-SP compared with SOAβPIN-SP in both cell models, whereas the nano-sized soot cores alone showed only minor effects. At the functional level, we found that SOANAP-SP augmented the secretion of malondialdehyde and interleukin-8 and may have induced the activation of endothelial cells in the coculture system. This activation was confirmed by comet assay, suggesting secondary genotoxicity and greater angiogenic potential. Chemical characterization of PM revealed distinct qualitative differences in the composition of the two secondary aerosol types.In this study using A549 and EA.hy926 cells exposed at ALI, SOA compounds had greater toxicity than primary SPs. Photochemical aging of naphthalene was associated with the formation of more oxidized, more aromatic SOAs with a higher oxidative potential and toxicity compared with β-pinene. Thus, we conclude that the influence of atmospheric chemistry on the chemical PM composition plays a crucial role for the adverse health outcome of emissions. https://doi.org/10.1289/EHP9413.
DOI: 10.15252/msb.202211517
2023
Cited 45 times
Predicting cellular responses to complex perturbations in high‐throughput screens
Abstract Recent advances in multiplexed single‐cell transcriptomics experiments facilitate the high‐throughput study of drug and genetic perturbations. However, an exhaustive exploration of the combinatorial perturbation space is experimentally unfeasible. Therefore, computational methods are needed to predict, interpret, and prioritize perturbations. Here, we present the compositional perturbation autoencoder (CPA), which combines the interpretability of linear models with the flexibility of deep‐learning approaches for single‐cell response modeling. CPA learns to in silico predict transcriptional perturbation response at the single‐cell level for unseen dosages, cell types, time points, and species. Using newly generated single‐cell drug combination data, we validate that CPA can predict unseen drug combinations while outperforming baseline models. Additionally, the architecture's modularity enables incorporating the chemical representation of the drugs, allowing the prediction of cellular response to completely unseen drugs. Furthermore, CPA is also applicable to genetic combinatorial screens. We demonstrate this by imputing in silico 5,329 missing combinations (97.6% of all possibilities) in a single‐cell Perturb‐seq experiment with diverse genetic interactions. We envision CPA will facilitate efficient experimental design and hypothesis generation by enabling in silico response prediction at the single‐cell level and thus accelerate therapeutic applications using single‐cell technologies.
DOI: 10.1038/s41556-022-01072-x
2023
Cited 27 times
Biologically informed deep learning to query gene programs in single-cell atlases
The increasing availability of large-scale single-cell atlases has enabled the detailed description of cell states. In parallel, advances in deep learning allow rapid analysis of newly generated query datasets by mapping them into reference atlases. However, existing data transformations learned to map query data are not easily explainable using biologically known concepts such as genes or pathways. Here we propose expiMap, a biologically informed deep-learning architecture that enables single-cell reference mapping. ExpiMap learns to map cells into biologically understandable components representing known 'gene programs'. The activity of each cell for a gene program is learned while simultaneously refining them and learning de novo programs. We show that expiMap compares favourably to existing methods while bringing an additional layer of interpretability to integrative single-cell analysis. Furthermore, we demonstrate its applicability to analyse single-cell perturbation responses in different tissues and species and resolve responses of patients who have coronavirus disease 2019 to different treatments across cell types.
DOI: 10.1101/2023.02.13.528102
2023
Cited 21 times
Optimizing Xenium In Situ data utility by quality assessment and best practice analysis workflows
Abstract The Xenium In Situ platform is a new spatial transcriptomics product commercialized by 10X Genomics capable of mapping hundreds of transcripts in situ at a subcellular resolution. Given the multitude of commercially available spatial transcriptomics technologies, recommendations in choice of platform and analysis guidelines are increasingly important. Herein, we explore eight preview Xenium datasets of the mouse brain and two of human breast cancer by comparing scalability, resolution, data quality, capacities and limitations with eight other spatially resolved transcriptomics technologies. In addition, we benchmarked the performance of multiple open source computational tools when applied to Xenium datasets in tasks including cell segmentation, segmentation-free analysis, selection of spatially variable genes and domain identification, among others. This study serves as the first independent analysis of the performance of Xenium, and provides best-practices and recommendations for analysis of such datasets.
DOI: 10.1186/1752-0509-3-98
2009
Cited 200 times
Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling
The understanding of regulatory and signaling networks has long been a core objective in Systems Biology. Knowledge about these networks is mainly of qualitative nature, which allows the construction of Boolean models, where the state of a component is either 'off' or 'on'. While often able to capture the essential behavior of a network, these models can never reproduce detailed time courses of concentration levels. Nowadays however, experiments yield more and more quantitative data. An obvious question therefore is how qualitative models can be used to explain and predict the outcome of these experiments.In this contribution we present a canonical way of transforming Boolean into continuous models, where the use of multivariate polynomial interpolation allows transformation of logic operations into a system of ordinary differential equations (ODE). The method is standardized and can readily be applied to large networks. Other, more limited approaches to this task are briefly reviewed and compared. Moreover, we discuss and generalize existing theoretical results on the relation between Boolean and continuous models. As a test case a logical model is transformed into an extensive continuous ODE model describing the activation of T-cells. We discuss how parameters for this model can be determined such that quantitative experimental results are explained and predicted, including time-courses for multiple ligand concentrations and binding affinities of different ligands. This shows that from the continuous model we may obtain biological insights not evident from the discrete one.The presented approach will facilitate the interaction between modeling and experiments. Moreover, it provides a straightforward way to apply quantitative analysis methods to qualitatively described systems.
DOI: 10.1371/journal.pgen.1003005
2012
Cited 169 times
Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information
Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these "unknown metabolites" is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype-metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.
DOI: 10.1371/journal.pone.0022649
2011
Cited 151 times
Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network
Hematopoiesis is an ideal model system for stem cell biology with advanced experimental access. A systems view on the interactions of core transcription factors is important for understanding differentiation mechanisms and dynamics. In this manuscript, we construct a Boolean network to model myeloid differentiation, specifically from common myeloid progenitors to megakaryocytes, erythrocytes, granulocytes and monocytes. By interpreting the hematopoietic literature and translating experimental evidence into Boolean rules, we implement binary dynamics on the resulting 11-factor regulatory network. Our network contains interesting functional modules and a concatenation of mutual antagonistic pairs. The state space of our model is a hierarchical, acyclic graph, typifying the principles of myeloid differentiation. We observe excellent agreement between the steady states of our model and microarray expression profiles of two different studies. Moreover, perturbations of the network topology correctly reproduce reported knockout phenotypes in silico. We predict previously uncharacterized regulatory interactions and alterations of the differentiation process, and line out reprogramming strategies.
DOI: 10.1016/j.sigpro.2011.10.007
2012
Cited 150 times
The signal separation evaluation campaign (2007–2010): Achievements and remaining challenges
We present the outcomes of three recent evaluation campaigns in the field of audio and biomedical source separation. These campaigns have witnessed a boom in the range of applications of source separation systems in the last few years, as shown by the increasing number of datasets from 1 to 9 and the increasing number of submissions from 15 to 34. We first discuss their impact on the definition of a reference evaluation methodology, together with shared datasets and software. We then present the key results obtained over almost all datasets. We conclude by proposing directions for future research and evaluation, based in particular on the ideas raised during the related panel discussion at the Ninth International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2010).
DOI: 10.1371/journal.pone.0015422
2010
Cited 143 times
The Structure of Borders in a Small World
Territorial subdivisions and geographic borders are essential for understanding phenomena in sociology, political science, history, and economics. They influence the interregional flow of information and cross-border trade and affect the diffusion of innovation and technology. However, it is unclear if existing administrative subdivisions that typically evolved decades ago still reflect the most plausible organizational structure of today. The complexity of modern human communication, the ease of long-distance movement, and increased interaction across political borders complicate the operational definition and assessment of geographic borders that optimally reflect the multi-scale nature of today's human connectivity patterns. What border structures emerge directly from the interplay of scales in human interactions is an open question. Based on a massive proxy dataset, we analyze a multi-scale human mobility network and compute effective geographic borders inherent to human mobility patterns in the United States. We propose two computational techniques for extracting these borders and for quantifying their strength. We find that effective borders only partially overlap with existing administrative borders, and show that some of the strongest mobility borders exist in unexpected regions. We show that the observed structures cannot be generated by gravity models for human traffic. Finally, we introduce the concept of link significance that clarifies the observed structure of effective borders. Our approach represents a novel type of quantitative, comparative analysis framework for spatially embedded multi-scale interaction networks in general and may yield important insight into a multitude of spatiotemporal phenomena generated by human activity.
DOI: 10.1186/1741-7007-6-37
2008
Cited 143 times
Erythropoietin enhances hippocampal long-term potentiation and memory
Abstract Background Erythropoietin (EPO) improves cognition of human subjects in the clinical setting by as yet unknown mechanisms. We developed a mouse model of robust cognitive improvement by EPO to obtain the first clues of how EPO influences cognition, and how it may act on hippocampal neurons to modulate plasticity. Results We show here that a 3-week treatment of young mice with EPO enhances long-term potentiation (LTP), a cellular correlate of learning processes in the CA1 region of the hippocampus. This treatment concomitantly alters short-term synaptic plasticity and synaptic transmission, shifting the balance of excitatory and inhibitory activity. These effects are accompanied by an improvement of hippocampus dependent memory, persisting for 3 weeks after termination of EPO injections, and are independent of changes in hematocrit. Networks of EPO-treated primary hippocampal neurons develop lower overall spiking activity but enhanced bursting in discrete neuronal assemblies. At the level of developing single neurons, EPO treatment reduces the typical increase in excitatory synaptic transmission without changing the number of synaptic boutons, consistent with prolonged functional silencing of synapses. Conclusion We conclude that EPO improves hippocampus dependent memory by modulating plasticity, synaptic connectivity and activity of memory-related neuronal networks. These mechanisms of action of EPO have to be further exploited for treating neuropsychiatric diseases.
DOI: 10.1093/bioinformatics/btv257
2015
Cited 141 times
Reconstructing gene regulatory dynamics from high-dimensional single-cell snapshot data
Abstract Motivation: High-dimensional single-cell snapshot data are becoming widespread in the systems biology community, as a mean to understand biological processes at the cellular level. However, as temporal information is lost with such data, mathematical models have been limited to capture only static features of the underlying cellular mechanisms. Results: Here, we present a modular framework which allows to recover the temporal behaviour from single-cell snapshot data and reverse engineer the dynamics of gene expression. The framework combines a dimensionality reduction method with a cell time-ordering algorithm to generate pseudo time-series observations. These are in turn used to learn transcriptional ODE models and do model selection on structural network features. We apply it on synthetic data and then on real hematopoietic stem cells data, to reconstruct gene expression dynamics during differentiation pathways and infer the structure of a key gene regulatory network. Availability and implementation: C++ and Matlab code available at https://www.helmholtz-muenchen.de/fileadmin/ICB/software/inferenceSnapshot.zip. Contact: fabian.theis@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online.
DOI: 10.1371/journal.pcbi.1005331
2017
Cited 135 times
Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks
Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics.
DOI: 10.1038/ncb3237
2015
Cited 132 times
Network plasticity of pluripotency transcription factors in embryonic stem cells
DOI: 10.1186/s12943-015-0401-6
2015
Cited 131 times
Erratum to: Next-generation sequencing reveals novel differentially regulated mRNAs, lncRNAs, miRNAs, sdRNAs and a piRNA in pancreatic cancer
Previous studies identified microRNAs (miRNAs) and messenger RNAs with significantly different expression between normal pancreas and pancreatic cancer (PDAC) tissues. Due to technological limitations of microarrays and real-time PCR systems these studies focused on a fixed set of targets. Expression of other RNA classes such as long intergenic non-coding RNAs or sno-derived RNAs has rarely been examined in pancreatic cancer. Here, we analysed the coding and non-coding transcriptome of six PDAC and five control tissues using next-generation sequencing. Besides the confirmation of several deregulated mRNAs and miRNAs, miRNAs without previous implication in PDAC were detected: miR-802, miR-2114 or miR-561. SnoRNA-derived RNAs (e.g. sno-HBII-296B) and piR-017061, a piwi-interacting RNA, were found to be differentially expressed between PDAC and control tissues. In silico target analysis of miR-802 revealed potential binding sites in the 3′ UTR of TCF4, encoding a transcription factor that controls Wnt signalling genes. Overexpression of miR-802 in MiaPaCa pancreatic cancer cells reduced TCF4 protein levels. Using Massive Analysis of cDNA Ends (MACE) we identified differential expression of 43 lincRNAs, long intergenic non-coding RNAs, e.g. LINC00261 and LINC00152 as well as several natural antisense transcripts like HNF1A-AS1 and AFAP1-AS1. Differential expression was confirmed by qPCR on the mRNA/miRNA/lincRNA level and by immunohistochemistry on the protein level. Here, we report a novel lncRNA, sncRNA and mRNA signature of PDAC. In silico prediction of ncRNA targets allowed for assigning potential functions to differentially regulated RNAs.
DOI: 10.15252/embr.201439937
2015
Cited 128 times
Atrx promotes heterochromatin formation at retrotransposons
More than 50% of mammalian genomes consist of retrotransposon sequences. Silencing of retrotransposons by heterochromatin is essential to ensure genomic stability and transcriptional integrity. Here, we identified a short sequence element in intracisternal A particle (IAP) retrotransposons that is sufficient to trigger heterochromatin formation. We used this sequence in a genome-wide shRNA screen and identified the chromatin remodeler Atrx as a novel regulator of IAP silencing. Atrx binds to IAP elements and is necessary for efficient heterochromatin formation. In addition, Atrx facilitates a robust and largely inaccessible heterochromatin structure as Atrx knockout cells display increased chromatin accessibility at retrotransposons and non-repetitive heterochromatic loci. In summary, we demonstrate a direct role of Atrx in the establishment and robust maintenance of heterochromatin.
DOI: 10.1186/1471-2164-11-224
2010
Cited 125 times
Intronic microRNAs support their host genes by mediating synergistic and antagonistic regulatory effects
MicroRNA-mediated control of gene expression via translational inhibition has substantial impact on cellular regulatory mechanisms. About 37% of mammalian microRNAs appear to be located within introns of protein coding genes, linking their expression to the promoter-driven regulation of the host gene. In our study we investigate this linkage towards a relationship beyond transcriptional co-regulation.Using measures based on both annotation and experimental data, we show that intronic microRNAs tend to support their host genes by regulation of target gene expression with significantly correlated expression patterns. We used expression data of three differentiating cell types and compared gene expression profiles of host and target genes. Many microRNA target genes show expression patterns significantly correlated with the expressions of the microRNA host genes. By calculating functional similarities between host and predicted microRNA target genes based on GO annotations, we confirm that many microRNAs link host and target gene activity in an either synergistic or antagonistic manner.These two regulatory effects may result from fine tuning of target gene expression functionally related to the host or knock-down of remaining opponent target gene expression. This finding allows to extend the common practice of mapping large scale gene expression data to protein associated genes with functionality of co-expressed intronic microRNAs.
DOI: 10.1186/1471-2105-14-297
2013
Cited 124 times
An automatic method for robust and fast cell detection in bright field images from high-throughput microscopy
In recent years, high-throughput microscopy has emerged as a powerful tool to analyze cellular dynamics in an unprecedentedly high resolved manner. The amount of data that is generated, for example in long-term time-lapse microscopy experiments, requires automated methods for processing and analysis. Available software frameworks are well suited for high-throughput processing of fluorescence images, but they often do not perform well on bright field image data that varies considerably between laboratories, setups, and even single experiments.In this contribution, we present a fully automated image processing pipeline that is able to robustly segment and analyze cells with ellipsoid morphology from bright field microscopy in a high-throughput, yet time efficient manner. The pipeline comprises two steps: (i) Image acquisition is adjusted to obtain optimal bright field image quality for automatic processing. (ii) A concatenation of fast performing image processing algorithms robustly identifies single cells in each image. We applied the method to a time-lapse movie consisting of ∼315,000 images of differentiating hematopoietic stem cells over 6 days. We evaluated the accuracy of our method by comparing the number of identified cells with manual counts. Our method is able to segment images with varying cell density and different cell types without parameter adjustment and clearly outperforms a standard approach. By computing population doubling times, we were able to identify three growth phases in the stem cell population throughout the whole movie, and validated our result with cell cycle times from single cell tracking.Our method allows fully automated processing and analysis of high-throughput bright field microscopy data. The robustness of cell detection and fast computation time will support the analysis of high-content screening experiments, on-line analysis of time-lapse experiments as well as development of methods to automatically track single-cell genealogies.
DOI: 10.1103/physreve.83.046127
2011
Cited 118 times
Vertex centralities in input-output networks reveal the structure of modern economies
Input-output tables describe the flows of goods and services between the sectors of an economy. These tables can be interpreted as weighted directed networks. At the usual level of aggregation, they contain nodes with strong self-loops and are almost completely connected. We derive two measures of node centrality that are well suited for such networks. Both are based on random walks and have interpretations as the propagation of supply shocks through the economy. Random walk centrality reveals the vertices most immediately affected by a shock. Counting betweenness identifies the nodes where a shock lingers longest. The two measures differ in how they treat self-loops. We apply both to data from a wide set of countries and uncover salient characteristics of the structures of these national economies. We further validate our indices by clustering according to sectors' centralities. This analysis reveals geographical proximity and similar developmental status.
DOI: 10.3892/ijo.2017.3834
2017
Cited 118 times
AURKA, DLGAP5, TPX2, KIF11 and CKAP5: Five specific mitosis-associated genes correlate with poor prognosis for non-small cell lung cancer patients
The growth of a tumor depends to a certain extent on an increase in mitotic events. Key steps during mitosis are the regulated assembly of the spindle apparatus and the separation of the sister chromatids. The microtubule-associated protein Aurora kinase A phosphorylates DLGAP5 in order to correctly segregate the chromatids. Its activity and recruitment to the spindle apparatus is regulated by TPX2. KIF11 and CKAP5 control the correct arrangement of the microtubules and prevent their degradation. In the present study, we investigated the role of these five molecules in non-small cell lung cancer (NSCLC). We analyzed the expression of the five genes in a large cohort of NSCLC patients (n=362) by quantitative real-time PCR. Each of the genes was highly overexpressed in the tumor tissues compared to corresponding normal lung tissue. The correlation of the expression of the individual genes depended on the histology. An increased expression of AURKA, DLGAP5, TPX2, KIF11 and CKAP5 was associated with poor overall survival (P=0.001-0.065). AURKA was a significant prognostic marker using multivariate analyses (P=0.006). Immunofluorescence studies demonstrated that the five mitosis-associated proteins co-localized with the spindle apparatus during cell division. Taken together, our data demonstrate that the expression of the mitosis-associated genes AURKA, DLGAP5, TPX2, KIF11 and CKAP5 is associated with the prognosis of NSCLC patients.
DOI: 10.1016/j.bpj.2008.10.068
2009
Cited 115 times
Blind Source Separation Techniques for the Decomposition of Multiply Labeled Fluorescence Images
Methods of blind source separation are used in many contexts to separate composite data sets according to their sources. Multiply labeled fluorescence microscopy images represent such sets, in which the sources are the individual labels. Their distributions are the quantities of interest and have to be extracted from the images. This is often challenging, since the recorded emission spectra of fluorescent dyes are environment- and instrument-specific. We have developed a nonnegative matrix factorization (NMF) algorithm to detect and separate spectrally distinct components of multiply labeled fluorescence images. It operates on spectrally resolved images and delivers both the emission spectra of the identified components and images of their abundance. We tested the proposed method using biological samples labeled with up to four spectrally overlapping fluorescent labels. In most cases, NMF accurately decomposed the images into contributions of individual dyes. However, the solutions are not unique when spectra overlap strongly or when images are diffuse in their structure. To arrive at satisfactory results in such cases, we extended NMF to incorporate preexisting qualitative knowledge about spectra and label distributions. We show how data acquired through excitations at two or three different wavelengths can be integrated and that multiple excitations greatly facilitate the decomposition. By allowing reliable decomposition in cases where the spectra of the individual labels are not known or are known only inaccurately, the proposed algorithms greatly extend the range of questions that can be addressed with quantitative microscopy.
DOI: 10.1186/1471-2105-11-233
2010
Cited 112 times
Odefy - From discrete to continuous models
Phenomenological information about regulatory interactions is frequently available and can be readily converted to Boolean models. Fully quantitative models, on the other hand, provide detailed insights into the precise dynamics of the underlying system. In order to connect discrete and continuous modeling approaches, methods for the conversion of Boolean systems into systems of ordinary differential equations have been developed recently. As biological interaction networks have steadily grown in size and complexity, a fully automated framework for the conversion process is desirable.We present Odefy, a MATLAB- and Octave-compatible toolbox for the automated transformation of Boolean models into systems of ordinary differential equations. Models can be created from sets of Boolean equations or graph representations of Boolean networks. Alternatively, the user can import Boolean models from the CellNetAnalyzer toolbox, GINSim and the PBN toolbox. The Boolean models are transformed to systems of ordinary differential equations by multivariate polynomial interpolation and optional application of sigmoidal Hill functions. Our toolbox contains basic simulation and visualization functionalities for both, the Boolean as well as the continuous models. For further analyses, models can be exported to SQUAD, GNA, MATLAB script files, the SB toolbox, SBML and R script files. Odefy contains a user-friendly graphical user interface for convenient access to the simulation and exporting functionalities. We illustrate the validity of our transformation approach as well as the usage and benefit of the Odefy toolbox for two biological systems: a mutual inhibitory switch known from stem cell differentiation and a regulatory network giving rise to a specific spatial expression pattern at the mid-hindbrain boundary.Odefy provides an easy-to-use toolbox for the automatic conversion of Boolean models to systems of ordinary differential equations. It can be efficiently connected to a variety of input and output formats for further analysis and investigations. The toolbox is open-source and can be downloaded at http://cmb.helmholtz-muenchen.de/odefy.
DOI: 10.1242/dev.123554
2015
Cited 112 times
Quantification of regenerative potential in primary human mammary epithelial cells
We present an organoid regeneration assay in which freshly isolated human mammary epithelial cells are cultured in adherent or floating collagen gels, corresponding to a rigid or compliant matrix environment. In both conditions, luminal progenitors form spheres, whereas basal cells generate branched ductal structures. In compliant but not rigid collagen gels, branching ducts form alveoli at their tips, express basal and luminal markers at correct positions, and display contractility, which is required for alveologenesis. Thereby, branched structures generated in compliant collagen gels resemble terminal ductal-lobular units (TDLUs), the functional units of the mammary gland. Using the membrane metallo-endopeptidase CD10 as a surface marker enriches for TDLU formation and reveals the presence of stromal cells within the CD49f(hi)/EpCAM(-) population. In summary, we describe a defined in vitro assay system to quantify cells with regenerative potential and systematically investigate their interaction with the physical environment at distinct steps of morphogenesis.
DOI: 10.1186/1741-7015-11-60
2013
Cited 112 times
Effects of smoking and smoking cessation on human serum metabolite profile: results from the KORA cohort study
Metabolomics helps to identify links between environmental exposures and intermediate biomarkers of disturbed pathways. We previously reported variations in phosphatidylcholines in male smokers compared with non-smokers in a cross-sectional pilot study with a small sample size, but knowledge of the reversibility of smoking effects on metabolite profiles is limited. Here, we extend our metabolomics study with a large prospective study including female smokers and quitters.Using targeted metabolomics approach, we quantified 140 metabolite concentrations for 1,241 fasting serum samples in the population-based Cooperative Health Research in the Region of Augsburg (KORA) human cohort at two time points: baseline survey conducted between 1999 and 2001 and follow-up after seven years. Metabolite profiles were compared among groups of current smokers, former smokers and never smokers, and were further assessed for their reversibility after smoking cessation. Changes in metabolite concentrations from baseline to the follow-up were investigated in a longitudinal analysis comparing current smokers, never smokers and smoking quitters, who were current smokers at baseline but former smokers by the time of follow-up. In addition, we constructed protein-metabolite networks with smoking-related genes and metabolites.We identified 21 smoking-related metabolites in the baseline investigation (18 in men and six in women, with three overlaps) enriched in amino acid and lipid pathways, which were significantly different between current smokers and never smokers. Moreover, 19 out of the 21 metabolites were found to be reversible in former smokers. In the follow-up study, 13 reversible metabolites in men were measured, of which 10 were confirmed to be reversible in male quitters. Protein-metabolite networks are proposed to explain the consistent reversibility of smoking effects on metabolites.We showed that smoking-related changes in human serum metabolites are reversible after smoking cessation, consistent with the known cardiovascular risk reduction. The metabolites identified may serve as potential biomarkers to evaluate the status of smoking cessation and characterize smoking-related diseases.
DOI: 10.1007/s00125-014-3362-1
2014
Cited 111 times
Feature ranking of type 1 diabetes susceptibility genes improves prediction of type 1 diabetes
More than 40 regions of the human genome confer susceptibility for type 1 diabetes and could be used to establish population screening strategies. The aim of our study was to identify weighted sets of SNP combinations for type 1 diabetes prediction.We applied multivariable logistic regression and Bayesian feature selection to the Type 1 Diabetes Genetics Consortium (T1DGC) dataset with genotyping of HLA plus 40 SNPs within other type 1 diabetes-associated gene regions in 4,574 cases and 1,207 controls. We tested the weighted models in an independent validation set (765 cases, 423 controls), and assessed their performance in 1,772 prospectively followed children.The inclusion of 40 non-HLA gene SNPs significantly improved the prediction of type 1 diabetes over that provided by HLA alone (p = 3.1 × 10(-25)), with a receiver operating characteristic AUC of 0.87 in the T1DGC set, and 0.84 in the validation set. Feature selection identified HLA plus nine SNPs from the PTPN22, INS, IL2RA, ERBB3, ORMDL3, BACH2, IL27, GLIS3 and RNLS genes that could achieve similar prediction accuracy as the total SNP set. Application of this ten SNP model to prospectively followed children was able to improve risk stratification over that achieved by HLA genotype alone.We provided a weighted risk model with selected SNPs that could be considered for recruitment of infants into studies of early type 1 diabetes natural history or appropriately safe prevention.
DOI: 10.1098/rsta.2011.0544
2013
Cited 105 times
Joining forces of Bayesian and frequentist methodology: a study for inference in the presence of non-identifiability
Increasingly complex applications involve large datasets in combination with nonlinear and high-dimensional mathematical models. In this context, statistical inference is a challenging issue that calls for pragmatic approaches that take advantage of both Bayesian and frequentist methods. The elegance of Bayesian methodology is founded in the propagation of information content provided by experimental data and prior assumptions to the posterior probability distribution of model predictions. However, for complex applications, experimental data and prior assumptions potentially constrain the posterior probability distribution insufficiently. In these situations, Bayesian Markov chain Monte Carlo sampling can be infeasible. From a frequentist point of view, insufficient experimental data and prior assumptions can be interpreted as non-identifiability. The profile-likelihood approach offers to detect and to resolve non-identifiability by experimental design iteratively. Therefore, it allows one to better constrain the posterior probability distribution until Markov chain Monte Carlo sampling can be used securely. Using an application from cell biology, we compare both methods and show that a successive application of the two methods facilitates a realistic assessment of uncertainty in model predictions.
DOI: 10.1186/1471-2105-13-120
2012
Cited 102 times
On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies
Genome-wide association studies (GWAS) with metabolic traits and metabolome-wide association studies (MWAS) with traits of biomedical relevance are powerful tools to identify the contribution of genetic, environmental and lifestyle factors to the etiology of complex diseases. Hypothesis-free testing of ratios between all possible metabolite pairs in GWAS and MWAS has proven to be an innovative approach in the discovery of new biologically meaningful associations. The p-gain statistic was introduced as an ad-hoc measure to determine whether a ratio between two metabolite concentrations carries more information than the two corresponding metabolite concentrations alone. So far, only a rule of thumb was applied to determine the significance of the p-gain.Here we explore the statistical properties of the p-gain through simulation of its density and by sampling of experimental data. We derive critical values of the p-gain for different levels of correlation between metabolite pairs and show that B/(2*α) is a conservative critical value for the p-gain, where α is the level of significance and B the number of tested metabolite pairs.We show that the p-gain is a well defined measure that can be used to identify statistically significant metabolite ratios in association studies and provide a conservative significance cut-off for the p-gain for use in future association studies with metabolic traits.
DOI: 10.1016/j.molmet.2017.06.021
2017
Cited 99 times
Systematic single-cell analysis provides new insights into heterogeneity and plasticity of the pancreas
Diabetes mellitus is characterized by loss or dysfunction of insulin-producing β-cells in the pancreas, resulting in failure of blood glucose regulation and devastating secondary complications. Thus, β-cells are currently the prime target for cell-replacement and regenerative therapy. Triggering endogenous repair is a promising strategy to restore β-cell mass and normoglycemia in diabetic patients. Potential strategies include targeting specific β-cell subpopulations to increase proliferation or maturation. Alternatively, transdifferentiation of pancreatic islet cells (e.g. α- or δ-cells), extra-islet cells (acinar and ductal cells), hepatocytes, or intestinal cells into insulin-producing cells might improve glycemic control. To this end, it is crucial to systematically characterize and unravel the transcriptional program of all pancreatic cell types at the molecular level in homeostasis and disease. Furthermore, it is necessary to better determine the underlying mechanisms of β-cell maturation, maintenance, and dysfunction in diabetes, to identify and molecularly profile endocrine subpopulations with regenerative potential, and to translate the findings from mice to man. Recent approaches in single-cell biology started to illuminate heterogeneity and plasticity in the pancreas that might be targeted for β-cell regeneration in diabetic patients.This review discusses recent literature on single-cell analysis including single-cell RNA sequencing, single-cell mass cytometry, and flow cytometry of pancreatic cell types in the context of mechanisms of endogenous β-cell regeneration. We discuss new findings on the regulation of postnatal β-cell proliferation and maturation. We highlight how single-cell analysis recapitulates described principles of functional β-cell heterogeneity in animal models and adds new knowledge on the extent of β-cell heterogeneity in humans as well as its role in homeostasis and disease. Furthermore, we summarize the findings on cell subpopulations with regenerative potential that might enable the formation of new β-cells in diseased state. Finally, we review new data on the transcriptional program and function of rare pancreatic cell types and their implication in diabetes.Novel, single-cell technologies offer high molecular resolution of cellular heterogeneity within the pancreas and provide information on processes and factors that govern β-cell homeostasis, proliferation, and maturation. Eventually, these technologies might lead to the characterization of cells with regenerative potential and unravel disease-associated changes in gene expression to identify cellular and molecular targets for therapy.
DOI: 10.1007/s00285-013-0711-5
2013
Cited 98 times
Method of conditional moments (MCM) for the Chemical Master Equation
DOI: 10.1371/journal.pone.0040009
2012
Cited 97 times
Body Fat Free Mass Is Associated with the Serum Metabolite Profile in a Population-Based Study
To characterise the influence of the fat free mass on the metabolite profile in serum samples from participants of the population-based KORA (Cooperative Health Research in the Region of Augsburg) S4 study.Analyses were based on metabolite profile from 965 participants of the S4 and 890 weight-stable subjects of its seven-year follow-up study (KORA F4). 190 different serum metabolites were quantified in a targeted approach including amino acids, acylcarnitines, phosphatidylcholines (PCs), sphingomyelins and hexose. Associations between metabolite concentrations and the fat free mass index (FFMI) were analysed using adjusted linear regression models. To draw conclusions on enzymatic reactions, intra-metabolite class ratios were explored. Pairwise relationships among metabolites were investigated and illustrated by means of Gaussian graphical models (GGMs).We found 339 significant associations between FFMI and various metabolites in KORA S4. Among the most prominent associations (p-values 4.75 × 10(-16)-8.95 × 10(-06)) with higher FFMI were increasing concentrations of the branched chained amino acids (BCAAs), ratios of BCAAs to glucogenic amino acids, and carnitine concentrations. For various PCs, a decrease in chain length or in saturation of the fatty acid moieties could be observed with increasing FFMI, as well as an overall shift from acyl-alkyl PCs to diacyl PCs. These findings were reproduced in KORA F4. The established GGMs supported the regression results and provided a comprehensive picture of the relationships between metabolites. In a sub-analysis, most of the discovered associations did not exist in obese subjects in contrast to non-obese subjects, possibly indicating derangements in skeletal muscle metabolism.A set of serum metabolites strongly associated with FFMI was identified and a network explaining the relationships among metabolites was established. These results offer a novel and more complete picture of the FFMI effects on serum metabolites in a data-driven network.
DOI: 10.1101/2020.05.22.111161
2020
Cited 96 times
Benchmarking atlas-level data integration in single-cell genomics
Abstract Cell atlases often include samples that span locations, labs, and conditions, leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. Choosing a data integration method is a challenge due to the difficulty of defining integration success. Here, we benchmark 38 method and preprocessing combinations on 77 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, altogether representing &gt;1.2 million cells distributed in nine atlas-level integration tasks. Our integration tasks span several common sources of variation such as individuals, species, and experimental labs. We evaluate methods according to scalability, usability, and their ability to remove batch effects while retaining biological variation. Using 14 evaluation metrics, we find that highly variable gene selection improves the performance of data integration methods, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, BBKNN, Scanorama, and scVI perform well, particularly on complex integration tasks; Seurat v3 performs well on simpler tasks with distinct biological signals; and methods that prioritize batch removal perform best for ATAC-seq data integration. Our freely available reproducible python module can be used to identify optimal data integration methods for new data, benchmark new methods, and improve method development.
DOI: 10.1038/s41587-019-0088-0
2019
Cited 92 times
Inferring population dynamics from single-cell RNA-sequencing time series data
Recent single-cell RNA-sequencing studies have suggested that cells follow continuous transcriptomic trajectories in an asynchronous fashion during development. However, observations of cell flux along trajectories are confounded with population size effects in snapshot experiments and are therefore hard to interpret. In particular, changes in proliferation and death rates can be mistaken for cell flux. Here we present pseudodynamics, a mathematical framework that reconciles population dynamics with the concepts underlying developmental trajectories inferred from time-series single-cell data. Pseudodynamics models population distribution shifts across trajectories to quantify selection pressure, population expansion, and developmental potentials. Applying this model to time-resolved single-cell RNA-sequencing of T-cell and pancreatic beta cell maturation, we characterize proliferation and apoptosis rates and identify key developmental checkpoints, data inaccessible to existing approaches. A new computational method allows key developmental checkpoints and important parameters of population dynamics to be inferred from single-cell RNA-sequencing time series data.
DOI: 10.1371/journal.pgen.1005274
2015
Cited 92 times
The Human Blood Metabolome-Transcriptome Interface
Biological systems consist of multiple organizational levels all densely interacting with each other to ensure function and flexibility of the system. Simultaneous analysis of cross-sectional multi-omics data from large population studies is a powerful tool to comprehensively characterize the underlying molecular mechanisms on a physiological scale. In this study, we systematically analyzed the relationship between fasting serum metabolomics and whole blood transcriptomics data from 712 individuals of the German KORA F4 cohort. Correlation-based analysis identified 1,109 significant associations between 522 transcripts and 114 metabolites summarized in an integrated network, the 'human blood metabolome-transcriptome interface' (BMTI). Bidirectional causality analysis using Mendelian randomization did not yield any statistically significant causal associations between transcripts and metabolites. A knowledge-based interpretation and integration with a genome-scale human metabolic reconstruction revealed systematic signatures of signaling, transport and metabolic processes, i.e. metabolic reactions mainly belonging to lipid, energy and amino acid metabolism. Moreover, the construction of a network based on functional categories illustrated the cross-talk between the biological layers at a pathway level. Using a transcription factor binding site enrichment analysis, this pathway cross-talk was further confirmed at a regulatory level. Finally, we demonstrated how the constructed networks can be used to gain novel insights into molecular mechanisms associated to intermediate clinical traits. Overall, our results demonstrate the utility of a multi-omics integrative approach to understand the molecular mechanisms underlying both normal physiology and disease.
DOI: 10.1038/mi.2015.110
2016
Cited 90 times
Interleukin-4 and interferon-γ orchestrate an epithelial polarization in the airways
Interferon-γ (IFN-γ) and interleukin-4 (IL-4) are key effector cytokines for the differentiation of T helper type 1 and 2 (Th1 and Th2) cells. Both cytokines induce fate-decisive transcription factors such as GATA3 and TBX21 that antagonize the polarized development of opposite phenotypes by direct regulation of each other's expression along with many other target genes. Although it is well established that mesenchymal cells directly respond to Th1 and Th2 cytokines, the nature of antagonistic differentiation programs in airway epithelial cells is only partially understood. In this study, primary normal human bronchial epithelial cells (NHBEs) were exposed to IL-4, IFN-γ, or both and genome-wide transcriptome analysis was performed. The study uncovers an antagonistic regulation pattern of IL-4 and IFN-γ in NHBEs, translating the Th1/Th2 antagonism directly in epithelial gene regulation. IL-4- and IFN-γ-induced transcription factor hubs form clusters, present in antagonistically and polarized gene regulation networks. Furthermore, the IL-4-dependent induction of IL-24 observed in rhinitis patients was downregulated by IFN-γ, and therefore IL-24 represents a potential biomarker of allergic inflammation and a Th2 polarized condition of the epithelium.
DOI: 10.1038/s41467-019-10461-0
2019
Cited 89 times
Integrated analysis of environmental and genetic influences on cord blood DNA methylation in new-borns
Epigenetic processes, including DNA methylation (DNAm), are among the mechanisms allowing integration of genetic and environmental factors to shape cellular function. While many studies have investigated either environmental or genetic contributions to DNAm, few have assessed their integrated effects. Here we examine the relative contributions of prenatal environmental factors and genotype on DNA methylation in neonatal blood at variably methylated regions (VMRs) in 4 independent cohorts (overall n = 2365). We use Akaike's information criterion to test which factors best explain variability of methylation in the cohort-specific VMRs: several prenatal environmental factors (E), genotypes in cis (G), or their additive (G + E) or interaction (GxE) effects. Genetic and environmental factors in combination best explain DNAm at the majority of VMRs. The CpGs best explained by either G, G + E or GxE are functionally distinct. The enrichment of genetic variants from GxE models in GWAS for complex disorders supports their importance for disease risk.
DOI: 10.1038/s41586-020-2882-8
2020
Cited 87 times
Inhibition of LTβR signalling activates WNT-induced regeneration in lung
Lymphotoxin β-receptor (LTβR) signalling promotes lymphoid neogenesis and the development of tertiary lymphoid structures1,2, which are associated with severe chronic inflammatory diseases that span several organ systems3–6. How LTβR signalling drives chronic tissue damage particularly in the lung, the mechanism(s) that regulate this process, and whether LTβR blockade might be of therapeutic value have remained unclear. Here we demonstrate increased expression of LTβR ligands in adaptive and innate immune cells, enhanced non-canonical NF-κB signalling, and enriched LTβR target gene expression in lung epithelial cells from patients with smoking-associated chronic obstructive pulmonary disease (COPD) and from mice chronically exposed to cigarette smoke. Therapeutic inhibition of LTβR signalling in young and aged mice disrupted smoking-related inducible bronchus-associated lymphoid tissue, induced regeneration of lung tissue, and reverted airway fibrosis and systemic muscle wasting. Mechanistically, blockade of LTβR signalling dampened epithelial non-canonical activation of NF-κB, reduced TGFβ signalling in airways, and induced regeneration by preventing epithelial cell death and activating WNT/β-catenin signalling in alveolar epithelial progenitor cells. These findings suggest that inhibition of LTβR signalling represents a viable therapeutic option that combines prevention of tertiary lymphoid structures1 and inhibition of apoptosis with tissue-regenerative strategies. Blockade of lymphotoxin β-receptor (LTβR) signalling restores WNT signalling and epithelial repair in a model of chronic obstructive pulmonary disease.
DOI: 10.1371/journal.pcbi.1005030
2016
Cited 82 times
Inference for Stochastic Chemical Kinetics Using Moment Equations and System Size Expansion
Quantitative mechanistic models are valuable tools for disentangling biochemical pathways and for achieving a comprehensive understanding of biological systems. However, to be quantitative the parameters of these models have to be estimated from experimental data. In the presence of significant stochastic fluctuations this is a challenging task as stochastic simulations are usually too time-consuming and a macroscopic description using reaction rate equations (RREs) is no longer accurate. In this manuscript, we therefore consider moment-closure approximation (MA) and the system size expansion (SSE), which approximate the statistical moments of stochastic processes and tend to be more precise than macroscopic descriptions. We introduce gradient-based parameter optimization methods and uncertainty analysis methods for MA and SSE. Efficiency and reliability of the methods are assessed using simulation examples as well as by an application to data for Epo-induced JAK/STAT signaling. The application revealed that even if merely population-average data are available, MA and SSE improve parameter identifiability in comparison to RRE. Furthermore, the simulation examples revealed that the resulting estimates are more reliable for an intermediate volume regime. In this regime the estimation error is reduced and we propose methods to determine the regime boundaries. These results illustrate that inference using MA and SSE is feasible and possesses a high sensitivity.
DOI: 10.1101/2020.12.22.423933
2020
Cited 81 times
Ultra-high sensitivity mass spectrometry quantifies single-cell proteome changes upon perturbation
Abstract Single-cell technologies are revolutionizing biology but are today mainly limited to imaging and deep sequencing 1–3 . However, proteins are the main drivers of cellular function and in-depth characterization of individual cells by mass spectrometry (MS)-based proteomics would thus be highly valuable and complementary 4,5 . Chemical labeling-based single-cell approaches introduce hundreds of cells into the MS, but direct analysis of single cells has not yet reached the necessary sensitivity, robustness and quantitative accuracy to answer biological questions 6,7 . Here, we develop a robust workflow combining miniaturized sample preparation, very low flow-rate chromatography and a novel trapped ion mobility mass spectrometer, resulting in a more than ten-fold improved sensitivity. We accurately and robustly quantify proteomes and their changes in single, FACS-isolated cells. Arresting cells at defined stages of the cell cycle by drug treatment retrieves expected key regulators such as CDK2NA, the E2 ubiquitin ligase UBE2S, DNA topoisomerases TOP2A/B and the chromatin regulator HMGA1. Furthermore, it highlights potential novel ones and allows cell phase prediction. Comparing the variability in more than 430 single-cell proteomes to transcriptome data revealed a stable core proteome despite perturbation, while the transcriptome appears volatile. This emphasizes substantial regulation of translation and sets the stage for its elucidation at the single cell level. Our technology can readily be applied to ultra-high sensitivity analyses of tissue material 8 , posttranslational modifications and small molecule studies to gain unprecedented insights into cellular heterogeneity in health and disease.
DOI: 10.15252/embj.201899518
2018
Cited 76 times
Metabolic regulation of pluripotency and germ cell fate through α‐ketoglutarate
An intricate link is becoming apparent between metabolism and cellular identities. Here, we explore the basis for such a link in an in vitro model for early mouse embryonic development: from naïve pluripotency to the specification of primordial germ cells (PGCs). Using single-cell RNA-seq with statistical modelling and modulation of energy metabolism, we demonstrate a functional role for oxidative mitochondrial metabolism in naïve pluripotency. We link mitochondrial tricarboxylic acid cycle activity to IDH2-mediated production of alpha-ketoglutarate and through it, the activity of key epigenetic regulators. Accordingly, this metabolite has a role in the maintenance of naïve pluripotency as well as in PGC differentiation, likely through preserving a particular histone methylation status underlying the transient state of developmental competence for the PGC fate. We reveal a link between energy metabolism and epigenetic control of cell state transitions during a developmental trajectory towards germ cell specification, and establish a paradigm for stabilizing fleeting cellular states through metabolic modulation.