ϟ

Babak Alipanahi

Here are all the papers by Babak Alipanahi that you can download and read on OA.mg.
Babak Alipanahi’s last known institution is . Download Babak Alipanahi PDFs here.

Claim this Profile →
DOI: 10.1038/nbt.3300
2015
Cited 2,386 times
Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.
DOI: 10.1126/science.1254806
2015
Cited 1,067 times
The human splicing code reveals new insights into the genetic determinants of disease
Predicting defects in RNA splicing Most eukaryotic messenger RNAs (mRNAs) are spliced to remove introns. Splicing generates uninterrupted open reading frames that can be translated into proteins. Splicing is often highly regulated, generating alternative spliced forms that code for variant proteins in different tissues. RNA-binding proteins that bind specific sequences in the mRNA regulate splicing. Xiong et al. develop a computational model that predicts splicing regulation for any mRNA sequence (see the Perspective by Guigó and Valcárcel). They use this to analyze more than half a million mRNA splicing sequence variants in the human genome. They are able to identify thousands of known disease-causing mutations, as well as many new disease candidates, including 17 new autism-linked genes. Science , this issue 10.1126/science.1254806 ; see also p. 124
DOI: 10.1101/gr.177790.114
2014
Cited 542 times
Widespread intron retention in mammals functionally tunes transcriptomes
Alternative splicing (AS) of precursor RNAs is responsible for greatly expanding the regulatory and functional capacity of eukaryotic genomes. Of the different classes of AS, intron retention (IR) is the least well understood. In plants and unicellular eukaryotes, IR is the most common form of AS, whereas in animals, it is thought to represent the least prevalent form. Using high-coverage poly(A)(+) RNA-seq data, we observe that IR is surprisingly frequent in mammals, affecting transcripts from as many as three-quarters of multiexonic genes. A highly correlated set of cis features comprising an "IR code" reliably discriminates retained from constitutively spliced introns. We show that IR acts widely to reduce the levels of transcripts that are less or not required for the physiology of the cell or tissue type in which they are detected. This "transcriptome tuning" function of IR acts through both nonsense-mediated mRNA decay and nuclear sequestration and turnover of IR transcripts. We further show that IR is linked to a cross-talk mechanism involving localized stalling of RNA polymerase II (Pol II) and reduced availability of spliceosomal components. Collectively, the results implicate a global checkpoint-type mechanism whereby reduced recruitment of splicing components coupled to Pol II pausing underlies widespread IR-mediated suppression of inappropriately expressed transcripts.
DOI: 10.1038/nature12270
2013
Cited 290 times
MBNL proteins repress ES-cell-specific alternative splicing and reprogramming
This study identifies MBNL proteins as negative regulators of alternative splicing events that are differentially regulated between ES cells and other cell types; several lines of evidence show that these proteins repress an ES cell alternative splicing program and the reprogramming of somatic cells to induced pluripotent stem cells. Ben Blencowe and colleagues identify the muscleblind-like RNA binding proteins MBNL1 and MBNL2 as negative regulators of alternative splicing events that are differentially regulated between embryonic stem cells and other cell types. Several lines of evidence show that they are involved in the regulation of embryonic-stem-cell-like alternative splicing patterns. The authors also identify a regulatory role during the reprogramming of fibroblasts to induced pluripotent stem (iPS) cells. Previous investigations of the core gene regulatory circuitry that controls the pluripotency of embryonic stem (ES) cells have largely focused on the roles of transcription, chromatin and non-coding RNA regulators1,2,3. Alternative splicing represents a widely acting mode of gene regulation4,5,6,7,8, yet its role in regulating ES-cell pluripotency and differentiation is poorly understood. Here we identify the muscleblind-like RNA binding proteins, MBNL1 and MBNL2, as conserved and direct negative regulators of a large program of cassette exon alternative splicing events that are differentially regulated between ES cells and other cell types. Knockdown of MBNL proteins in differentiated cells causes switching to an ES-cell-like alternative splicing pattern for approximately half of these events, whereas overexpression of MBNL proteins in ES cells promotes differentiated-cell-like alternative splicing patterns. Among the MBNL-regulated events is an ES-cell-specific alternative splicing switch in the forkhead family transcription factor FOXP1 that controls pluripotency9. Consistent with a central and negative regulatory role for MBNL proteins in pluripotency, their knockdown significantly enhances the expression of key pluripotency genes and the formation of induced pluripotent stem cells during somatic cell reprogramming.
DOI: 10.1038/npjgenmed.2015.12
2016
Cited 285 times
Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine
Abstract The standard of care for first-tier clinical investigation of the aetiology of congenital malformations and neurodevelopmental disorders is chromosome microarray analysis (CMA) for copy-number variations (CNVs), often followed by gene(s)-specific sequencing searching for smaller insertion–deletions (indels) and single-nucleotide variant (SNV) mutations. Whole-genome sequencing (WGS) has the potential to capture all classes of genetic variation in one experiment; however, the diagnostic yield for mutation detection of WGS compared to CMA, and other tests, needs to be established. In a prospective study we utilised WGS and comprehensive medical annotation to assess 100 patients referred to a paediatric genetics service and compared the diagnostic yield versus standard genetic testing. WGS identified genetic variants meeting clinical diagnostic criteria in 34% of cases, representing a fourfold increase in diagnostic rate over CMA (8% ; P value=1.42E−05) alone and more than twofold increase in CMA plus targeted gene sequencing (13%; P value=0.0009). WGS identified all rare clinically significant CNVs that were detected by CMA. In 26 patients, WGS revealed indel and missense mutations presenting in a dominant (63%) or a recessive (37%) manner. We found four subjects with mutations in at least two genes associated with distinct genetic disorders, including two cases harbouring a pathogenic CNV and SNV. When considering medically actionable secondary findings in addition to primary WGS findings, 38% of patients would benefit from genetic counselling. Clinical implementation of WGS as a primary test will provide a higher diagnostic yield than conventional genetic testing and potentially reduce the time required to reach a genetic diagnosis.
DOI: 10.1109/jproc.2015.2494198
2016
Cited 202 times
Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets
In this paper, we provide an introduction to machine learning tasks that address important problems in genomic medicine. One of the goals of genomic medicine is to determine how variations in the DNA of individuals can affect the risk of different diseases, and to find causal explanations so that targeted therapies can be designed. Here we focus on how machine learning can help to model the relationship between DNA and the quantities of key molecules in the cell, with the premise that these quantities, which we refer to as cell variables, may be associated with disease risks. Modern biology allows high-throughput measurement of many such cell variables, including gene expression, splicing, and proteins binding to nucleic acids, which can all be treated as training targets for predictive models. With the growing availability of large-scale data sets and advanced computational techniques such as deep learning, researchers can help to usher in a new era of effective genomic medicine.
DOI: 10.1038/npjgenmed.2016.27
2016
Cited 186 times
Genome-wide characteristics of de novo mutations in autism
De novo mutations (DNMs) are important in Autism Spectrum Disorder (ASD), but so far analyses have mainly been on the ~1.5% of the genome encoding genes. Here, we performed whole genome sequencing (WGS) of 200 ASD parent-child trios and characterized germline and somatic DNMs. We confirmed that the majority of germline DNMs (75.6%) originated from the father, and these increased significantly with paternal age only (p=4.2×10-10). However, when clustered DNMs (those within 20kb) were found in ASD, not only did they mostly originate from the mother (p=7.7×10-13), but they could also be found adjacent to de novo copy number variations (CNVs) where the mutation rate was significantly elevated (p=2.4×10-24). By comparing DNMs detected in controls, we found a significant enrichment of predicted damaging DNMs in ASD cases (p=8.0×10-9; OR=1.84), of which 15.6% (p=4.3×10-3) and 22.5% (p=7.0×10-5) were in the non-coding or genic non-coding, respectively. The non-coding elements most enriched for DNM were untranslated regions of genes, boundaries involved in exon-skipping and DNase I hypersensitive regions. Using microarrays and a novel outlier detection test, we also found aberrant methylation profiles in 2/185 (1.1%) of ASD cases. These same individuals carried independently identified DNMs in the ASD risk- and epigenetic- genes DNMT3A and ADNP. Our data begins to characterize different genome-wide DNMs, and highlight the contribution of non-coding variants, to the etiology of ASD.
DOI: 10.1038/ng.2980
2014
Cited 153 times
Brain-expressed exons under purifying selection are enriched for de novo mutations in autism spectrum disorder
DOI: 10.1038/s41591-020-0893-5
2020
Cited 82 times
The effect of LRRK2 loss-of-function variants in humans
Human genetic variants predicted to cause loss-of-function of protein-coding genes (pLoF variants) provide natural in vivo models of human gene inactivation and can be valuable indicators of gene function and the potential toxicity of therapeutic inhibitors targeting these genes1,2. Gain-of-kinase-function variants in LRRK2 are known to significantly increase the risk of Parkinson's disease3,4, suggesting that inhibition of LRRK2 kinase activity is a promising therapeutic strategy. While preclinical studies in model organisms have raised some on-target toxicity concerns5-8, the biological consequences of LRRK2 inhibition have not been well characterized in humans. Here, we systematically analyze pLoF variants in LRRK2 observed across 141,456 individuals sequenced in the Genome Aggregation Database (gnomAD)9, 49,960 exome-sequenced individuals from the UK Biobank and over 4 million participants in the 23andMe genotyped dataset. After stringent variant curation, we identify 1,455 individuals with high-confidence pLoF variants in LRRK2. Experimental validation of three variants, combined with previous work10, confirmed reduced protein levels in 82.5% of our cohort. We show that heterozygous pLoF variants in LRRK2 reduce LRRK2 protein levels but that these are not strongly associated with any specific phenotype or disease state. Our results demonstrate the value of large-scale genomic databases and phenotyping of human loss-of-function carriers for target validation in drug discovery.
DOI: 10.1016/j.neuron.2018.04.014
2018
Cited 68 times
Common Variant Burden Contributes to the Familial Aggregation of Migraine in 1,589 Families
Complex traits, including migraine, often aggregate in families, but the underlying genetic architecture behind this is not well understood. The aggregation could be explained by rare, penetrant variants that segregate according to Mendelian inheritance or by the sufficient polygenic accumulation of common variants, each with an individually small effect, or a combination of the two hypotheses. In 8,319 individuals across 1,589 migraine families, we calculated migraine polygenic risk scores (PRS) and found a significantly higher common variant burden in familial cases (n = 5,317, OR = 1.76, 95% CI = 1.71–1.81, p = 1.7 × 10−109) compared to population cases from the FINRISK cohort (n = 1,101, OR = 1.32, 95% CI = 1.25–1.38, p = 7.2 × 10−17). The PRS explained 1.6% of the phenotypic variance in the population cases and 3.5% in the familial cases (including 2.9% for migraine without aura, 5.5% for migraine with typical aura, and 8.2% for hemiplegic migraine). The results demonstrate a significant contribution of common polygenic variation to the familial aggregation of migraine.
DOI: 10.1038/s41597-020-0401-2
2020
Cited 67 times
Fox Insight collects online, longitudinal patient-reported outcomes and genetic data on Parkinson’s disease
Abstract Fox Insight is an online, longitudinal health study of people with and without Parkinson’s disease with targeted enrollment set to at least 125,000 individuals. Fox Insight data is a rich data set facilitating discovery, validation, and reproducibility in Parkinson’s disease research. The dataset is generated through routine longitudinal assessments (health and medical questionnaires evaluated at regular cycles), one-time questionnaires about environmental exposure and healthcare preferences, and genetic data collection. Qualified Researchers can explore, analyze, and download patient-reported outcomes (PROs) data and Parkinson’s disease- related genetic variants at https://foxden.michaeljfox.org . The full Fox Insight genetic data set, including approximately 600,000 single nucleotide polymorphisms (SNPs), can be requested separately with institutional review and are described outside of this data descriptor.
DOI: 10.1038/s41467-019-08546-x
2019
Cited 64 times
Correspondence between cerebral glucose metabolism and BOLD reveals relative power and cost in human brain
The correspondence between cerebral glucose metabolism (indexing energy utilization) and synchronous fluctuations in blood oxygenation (indexing neuronal activity) is relevant for neuronal specialization and is affected by brain disorders. Here, we define novel measures of relative power (rPWR, extent of concurrent energy utilization and activity) and relative cost (rCST, extent that energy utilization exceeds activity), derived from FDG-PET and fMRI. We show that resting-state networks have distinct energetic signatures and that brain could be classified into major bilateral segments based on rPWR and rCST. While medial-visual and default-mode networks have the highest rPWR, frontoparietal networks have the highest rCST. rPWR and rCST estimates are generalizable to other indexes of energy supply and neuronal activity, and are sensitive to neurocognitive effects of acute and chronic alcohol exposure. rPWR and rCST are informative metrics for characterizing brain pathology and alternative energy use, and may provide new multimodal biomarkers of neuropsychiatric disorders.
DOI: 10.1038/s41467-020-20246-5
2021
Cited 49 times
Disease risk scores for skin cancers
Abstract We trained and validated risk prediction models for the three major types of skin cancer— basal cell carcinoma (BCC), squamous cell carcinoma (SCC), and melanoma—on a cross-sectional and longitudinal dataset of 210,000 consented research participants who responded to an online survey covering personal and family history of skin cancer, skin susceptibility, and UV exposure. We developed a primary disease risk score (DRS) that combined all 32 identified genetic and non-genetic risk factors. Top percentile DRS was associated with an up to 13-fold increase (odds ratio per standard deviation increase >2.5) in the risk of developing skin cancer relative to the middle DRS percentile. To derive lifetime risk trajectories for the three skin cancers, we developed a second and age independent disease score, called DRSA. Using incident cases, we demonstrated that DRSA could be used in early detection programs for identifying high risk asymptotic individuals, and predicting when they are likely to develop skin cancer. High DRSA scores were not only associated with earlier disease diagnosis (by up to 14 years), but also with more severe and recurrent forms of skin cancer.
DOI: 10.1038/s41467-020-20368-w
2021
Cited 41 times
A comprehensive re-assessment of the association between vitamin D and cancer susceptibility using Mendelian randomization
Previous Mendelian randomization (MR) studies on 25-hydroxyvitamin D (25(OH)D) and cancer have typically adopted a handful of variants and found no relationship between 25(OH)D and cancer; however, issues of horizontal pleiotropy cannot be reliably addressed. Using a larger set of variants associated with 25(OH)D (74 SNPs, up from 6 previously), we perform a unified MR analysis to re-evaluate the relationship between 25(OH)D and ten cancers. Our findings are broadly consistent with previous MR studies indicating no relationship, apart from ovarian cancers (OR 0.89; 95% C.I: 0.82 to 0.96 per 1 SD change in 25(OH)D concentration) and basal cell carcinoma (OR 1.16; 95% C.I.: 1.04 to 1.28). However, after adjustment for pigmentation related variables in a multivariable MR framework, the BCC findings were attenuated. Here we report that lower 25(OH)D is unlikely to be a causal risk factor for most cancers, with our study providing more precise confidence intervals than previously possible.
DOI: 10.1038/s41588-023-01372-4
2023
Cited 13 times
Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models
DOI: 10.1038/s41467-021-27930-0
2022
Cited 22 times
DeepNull models non-linear covariate effects to improve phenotypic prediction and association power
Genome-wide association studies (GWASs) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n = 370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average).
DOI: 10.1093/bioinformatics/btp225
2009
Cited 67 times
PICKY: a novel SVD-based NMR spectra peak picking method
Picking peaks from experimental NMR spectra is a key unsolved problem for automated NMR protein structure determination. Such a process is a prerequisite for resonance assignment, nuclear overhauser enhancement (NOE) distance restraint assignment, and structure calculation tasks. Manual or semi-automatic peak picking, which is currently the prominent way used in NMR labs, is tedious, time consuming and costly.We introduce new ideas, including noise-level estimation, component forming and sub-division, singular value decomposition (SVD)-based peak picking and peak pruning and refinement. PICKY is developed as an automated peak picking method. Different from the previous research on peak picking, we provide a systematic study of the proposed method. PICKY is tested on 32 real 2D and 3D spectra of eight target proteins, and achieves an average of 88% recall and 74% precision. PICKY is efficient. It takes PICKY on average 15.7 s to process an NMR spectrum. More important than these numbers, PICKY actually works in practice. We feed peak lists generated by PICKY to IPASS for resonance assignment, feed IPASS assignment to SPARTA for fragments generation, and feed SPARTA fragments to FALCON for structure calculation. This results in high-resolution structures of several proteins, for example, TM1112, at 1.25 A.PICKY is available upon request. The peak lists of PICKY can be easily loaded by SPARKY to enable a better interactive strategy for rapid peak picking.
DOI: 10.1534/g3.115.021345
2015
Cited 42 times
Whole-Genome Sequencing Suggests Schizophrenia Risk Mechanisms in Humans with 22q11.2 Deletion Syndrome
Abstract Chromosome 22q11.2 microdeletions impart a high but incomplete risk for schizophrenia. Possible mechanisms include genome-wide effects of DGCR8 haploinsufficiency. In a proof-of-principle study to assess the power of this model, we used high-quality, whole-genome sequencing of nine individuals with 22q11.2 deletions and extreme phenotypes (schizophrenia, or no psychotic disorder at age >50 years). The schizophrenia group had a greater burden of rare, damaging variants impacting protein-coding neurofunctional genes, including genes involved in neuron projection (nominal P = 0.02, joint burden of three variant types). Variants in the intact 22q11.2 region were not major contributors. Restricting to genes affected by a DGCR8 mechanism tended to amplify between-group differences. Damaging variants in highly conserved long intergenic noncoding RNA genes also were enriched in the schizophrenia group (nominal P = 0.04). The findings support the 22q11.2 deletion model as a threshold-lowering first hit for schizophrenia risk. If applied to a larger and thus better-powered cohort, this appears to be a promising approach to identify genome-wide rare variants in coding and noncoding sequence that perturb gene networks relevant to idiopathic schizophrenia. Similarly designed studies exploiting genetic models may prove useful to help delineate the genetic architecture of other complex phenotypes.
DOI: 10.1038/s41467-020-14451-5
2020
Cited 37 times
Genomic analysis of male puberty timing highlights shared genetic basis with hair colour and lifespan
Abstract The timing of puberty is highly variable and is associated with long-term health outcomes. To date, understanding of the genetic control of puberty timing is based largely on studies in women. Here, we report a multi-trait genome-wide association study for male puberty timing with an effective sample size of 205,354 men. We find moderately strong genomic correlation in puberty timing between sexes (rg = 0.68) and identify 76 independent signals for male puberty timing. Implicated mechanisms include an unexpected link between puberty timing and natural hair colour, possibly reflecting common effects of pituitary hormones on puberty and pigmentation. Earlier male puberty timing is genetically correlated with several adverse health outcomes and Mendelian randomization analyses show a genetic association between male puberty timing and shorter lifespan. These findings highlight the relationships between puberty timing and health outcomes, and demonstrate the value of genetic studies of puberty timing in both sexes.
DOI: 10.1038/s41531-019-0077-5
2019
Cited 35 times
The Parkinson’s phenome—traits associated with Parkinson’s disease in a broadly phenotyped cohort
Abstract In order to systematically describe the Parkinson’s disease phenome, we performed a series of 832 cross-sectional case-control analyses in a large database. Responses to 832 online survey-based phenotypes including diseases, medications, and environmental exposures were analyzed in 23andMe research participants. For each phenotype, survey respondents were used to construct a cohort of Parkinson’s disease cases and age-matched and sex-matched controls, and an association test was performed using logistic regression. Cohorts included a median of 3899 Parkinson’s disease cases and 49,808 controls, all of European ancestry. Highly correlated phenotypes were removed and the novelty of each significant association was systematically assessed (assigned to one of four categories: known, likely, unclear, or novel). Parkinson’s disease diagnosis was associated with 122 phenotypes. We replicated 27 known associations and found 23 associations with a strong a priori link to a known association. We discovered 42 associations that have not previously been reported. Migraine, obsessive-compulsive disorder, and seasonal allergies were associated with Parkinson’s disease and tend to occur decades before the typical age of diagnosis for Parkinson’s disease. The phenotypes that currently comprise the Parkinson’s disease phenome have mostly been explored in relatively small purpose-built studies. Using a single large dataset, we have successfully reproduced many of these established associations and have extended the Parkinson’s disease phenome by discovering novel associations. Our work paves the way for studies of these associated phenotypes that explore shared molecular mechanisms with Parkinson’s disease, infer causal relationships, and improve our ability to identify individuals at high-risk of Parkinson’s disease.
DOI: 10.1089/cmb.2012.0089
2013
Cited 32 times
Determining Protein Structures from NOESY Distance Constraints by Semidefinite Programming
Contemporary practical methods for protein nuclear magnetic resonance (NMR) structure determination use molecular dynamics coupled with a simulated annealing schedule. The objective of these methods is to minimize the error of deviating from the nuclear overhauser effect (NOE) distance constraints. However, the corresponding objective function is highly nonconvex and, consequently, difficult to optimize. Euclidean distance matrix (EDM) methods based on semidefinite programming (SDP) provide a natural framework for these problems. However, the high complexity of SDP solvers and the often noisy distance constraints provide major challenges to this approach. The main contribution of this article is a new SDP formulation for the EDM approach that overcomes these two difficulties. We model the protein as a set of intersecting two- and three-dimensional cliques. Then, we adapt and extend a technique called semidefinite facial reduction to reduce the SDP problem size to approximately one quarter of the size of the original problem. The reduced SDP problem can be solved approximately 100 times faster, and it is also more resistant to numerical problems from erroneous and inexact distance bounds.
DOI: 10.1016/j.ajhg.2017.10.001
2017
Cited 31 times
Multiethnic GWAS Reveals Polygenic Architecture of Earlobe Attachment
The genetic basis of earlobe attachment has been a matter of debate since the early 20th century, such that geneticists argue both for and against polygenic inheritance. Recent genetic studies have identified a few loci associated with the trait, but large-scale analyses are still lacking. Here, we performed a genome-wide association study of lobe attachment in a multiethnic sample of 74,660 individuals from four cohorts (three with the trait scored by an expert rater and one with the trait self-reported). Meta-analysis of the three expert-rater-scored cohorts revealed six associated loci harboring numerous candidate genes, including EDAR, SP5, MRPS22, ADGRG6 (GPR126), KIAA1217, and PAX9. The large self-reported 23andMe cohort recapitulated each of these six loci. Moreover, meta-analysis across all four cohorts revealed a total of 49 significant (p < 5 × 10-8) loci. Annotation and enrichment analyses of these 49 loci showed strong evidence of genes involved in ear development and syndromes with auricular phenotypes. RNA sequencing data from both human fetal ear and mouse second branchial arch tissue confirmed that genes located among associated loci showed evidence of expression. These results provide strong evidence for the polygenic nature of earlobe attachment and offer insights into the biological basis of normal and abnormal ear development.
DOI: 10.1093/hmg/ddz294
2019
Cited 27 times
Insights into the genetic basis of retinal detachment
Retinal detachment (RD) is a serious and common condition, but genetic studies to date have been hampered by the small size of the assembled cohorts. In the UK Biobank data set, where RD was ascertained by self-report or hospital records, genetic correlations between RD and high myopia or cataract operation were, respectively, 0.46 (SE = 0.08) and 0.44 (SE = 0.07). These correlations are consistent with known epidemiological associations. Through meta-analysis of genome-wide association studies using UK Biobank RD cases (N = 3 977) and two cohorts, each comprising ~1 000 clinically ascertained rhegmatogenous RD patients, we uncovered 11 genome-wide significant association signals. These are near or within ZC3H11B, BMP3, COL22A1, DLG5, PLCE1, EFEMP2, TYR, FAT3, TRIM29, COL2A1 and LOXL1. Replication in the 23andMe data set, where RD is self-reported by participants, firmly establishes six RD risk loci: FAT3, COL22A1, TYR, BMP3, ZC3H11B and PLCE1. Based on the genetic associations with eye traits described to date, the first two specifically impact risk of a RD, whereas the last four point to shared aetiologies with macular condition, myopia and glaucoma. Fine-mapping prioritized the lead common missense variant (TYR S192Y) as causal variant at the TYR locus and a small set of credible causal variants at the FAT3 locus. The larger study size presented here, enabled by resources linked to health records or self-report, provides novel insights into RD aetiology and underlying pathological pathways.
DOI: 10.1038/nbt.2657
2013
Cited 26 times
Network cleanup
DOI: 10.1016/j.neuron.2018.08.029
2018
Cited 19 times
Common Variant Burden Contributes to the Familial Aggregation of Migraine in 1,589 Families
(Neuron 98, 743–753.e1–e4; May 16, 2018) In the original publication of this paper, the middle initial of Michael D. Ferrari’s name was inadvertently left out. This has since been corrected online. The authors apologize for the error. Common Variant Burden Contributes to the Familial Aggregation of Migraine in 1,589 FamiliesGormley et al.NeuronMay 3, 2018In BriefGormley et al. use polygenic risk scores to show that common variation, captured by genome-wide association studies, in combination contributes to the aggregation of migraine in families. The results may have similar implications for other complex traits in general. Full-Text PDF Open Archive
DOI: 10.1038/s41467-022-31473-3
2022
Cited 8 times
The genetic architecture of pneumonia susceptibility implicates mucin biology and a relationship with psychiatric illness
Pneumonia remains one of the leading causes of death worldwide. In this study, we use genome-wide meta-analysis of lifetime pneumonia diagnosis (N = 391,044) to identify four association signals outside of the previously implicated major histocompatibility complex region. Integrative analyses and finemapping of these signals support clinically tractable targets, including the mucin MUC5AC and tumour necrosis factor receptor superfamily member TNFRSF1A. Moreover, we demonstrate widespread evidence of genetic overlap with pneumonia susceptibility across the human phenome, including particularly significant correlations with psychiatric phenotypes that remain significant after testing differing phenotype definitions for pneumonia or genetically conditioning on smoking behaviour. Finally, we show how polygenic risk could be utilised for precision treatment formulation or drug repurposing through pneumonia risk scores constructed using variants mapped to pathways with known drug targets. In summary, we provide insights into the genetic architecture of pneumonia susceptibility and genetics informed targets for drug development or repositioning.
DOI: 10.1158/1538-7445.am2024-3678
2024
Abstract 3678: Beyond detection: AI-based classification of breast cancer invasiveness using cell-free orphan non-coding RNAs
Abstract Background: Approximately 1 in 8 women will be impacted by breast cancer in their lifetimes. Earlier detection of breast cancer through screening has improved survival. Liquid biopsies have the potential to complement existing screening methods by enabling earlier detection and differentiating invasive cancer (IBC) from ductal carcinoma in-situ (DCIS). We previously demonstrated high sensitivity and specificity for early detection of IBC by using a blood-based liquid biopsy platform to analyze a novel category of cancer-associated small RNAs, termed orphan RNAs (oncRNAs). Here, we developed a test that could not only detect the presence of cancer, but also classify the invasiveness of breast cancer. Methods: We utilized The Cancer Genome Atlas (TCGA) small RNA profiles to discover a library of 20,538 oncRNAs that were significantly enriched among 1,103 breast tumors compared to 349 controls from normal tissues spanning multiple tissue sites, limited to female samples. The diagnostic performance of these oncRNAs was assessed in an independent cohort of serum samples from 708 women, including 380 breast cancer patients (221 IBC and 159 DCIS; mean age: 58.0 ± 13.4 years) and 328 age-matched controls (mean age: 58.4 ± 13.7 years). We sequenced the small RNA content from 1 ml of serum from these patients at an average depth of 21.6 million 50-bp single-end reads. We detected 19,736 (96%) of the breast cancer-specific oncRNA library within at least one sample in this cohort. We then trained a multi-class generative AI model using 5-fold cross-validation to predict IBC, DCIS, and absence of breast cancer (IBC or DCIS). Results: Our oncRNA-based generative AI model achieved an overall AUC of 0.95 (95% CI: 0.94-0.97) for prediction of breast cancer versus cancer-free controls. At 90% specificity, overall model sensitivity is 90.0% (86.5%-92.8%). For DCIS and stage I IBC, the model has a sensitivity of 88.1% (82.0%-92.7%) and 90.4% (81.9%-95.8%) respectively, both at 90% specificity. In the second step, restricting to samples flagged as cancer, we observed an overall AUC of 0.9 (0.87-0.93) and sensitivity of 62.4% (55.3%-69.1%) at 90% specificity for discriminating against invasive breast cancer. Conclusions: We have demonstrated the potential utility of oncRNAs as the foundation for a liquid biopsy platform for sensitive and accurate early detection of breast cancer. Our liquid biopsy assay has the potential to complement standard of care by not only detecting breast cancer but also differentiating IBC from DCIS. Citation Format: Mehran Karimzadeh, Taylor B. Cavazos, Nae-Chyun Chen, Noura K. Tbeileh, David Siegel, Amir Momen-Roknabadi, Jennifer Yen, Jeremy Ku, Selina Chen, Diana Corti, Alice Huang, Dang Nguyen, Rose Hanna, Ti Lam, Seda Kilinc, Philip Murzynowski, Jieyang Wang, Xuan Zhao, Andy Pohl, Babak Behsaz, Helen Li, Lisa Fish, Kim H. Chau, Marra S. Francis, Laura J. Van't Veer, Laura J. Esserman, Patrick A. Arensdorf, Hani Goodarzi, Fereydoun Hormozdiari, Babak Alipanahi. Beyond detection: AI-based classification of breast cancer invasiveness using cell-free orphan non-coding RNAs [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 3678.
DOI: 10.1158/1538-7445.am2024-5215
2024
Abstract 5215: Orphan noncoding RNA (oncRNA) liquid biopsy assay is prognostic for survival in patients with triple-negative breast cancer (TNBC) and residual disease
Abstract Introduction: Residual disease (RD) after neoadjuvant systemic therapy is associated with high recurrence risk in patients with TNBC. oncRNAs are a novel category of small RNAs that are largely absent in healthy tissue but enriched in tumors and can be detected using a blood-based assay. This study investigated impact of an oncRNA recurrence risk model on outcomes in TNBC patients with RD. Methods: Study population included stage I-III TNBC patients with RD and available end-of-treatment (EOT) serum samples who were enrolled in a multisite prospective cohort study. EOT samples were collected after completion of all curative treatment (local/systemic). Small RNAs were isolated from EOT serum, sequenced at average depth of 76.5 ± 12.5 million 50bp single end reads, and annotated using a bespoke bioinformatics pipeline to identify oncRNAs. Cancer risk scores were generated using an oncRNA based tumor detection artificial intelligence model trained on 451 treatment naive breast cancer samples and 470 samples from individuals without known cancer diagnosis. Score cutpoint for high vs low recurrence risk was determined through ROC analysis. Impact of EOT oncRNA risk category on event free survival (EFS) and overall survival (OS) was estimated by Kaplan Meier method and compared by log rank test followed by Cox regression. Residual cancer burden (RCB) was determined according to classification by Symmans et al. Results: oncRNA isolation/score generation was successful for 79 out of 80 TNBC patients with RD and available EOT serum sample. Median age was 48 years and 39% had node positive disease. RCB class distribution was as follows: RCB I=27%, RCB II=49%, RCB III=18%. Training set (n=39) was used to define oncRNA risk score cutpoint. In the testing set (n=40), 38% were classified as oncRNA high-risk and 62% as oncRNA low-risk. oncRNA risk category was not associated with baseline T stage, nodal status, or RCB class. oncRNA high-risk status was associated with lower EFS and OS; 3y EFS was 47% and 73% (HR 2.76, 95% CI 0.92-9.24, p=0.058) and OS 53% and 76% (HR 3.78, 95% CI 1.19-12.00, p=0.016) in high and low risk groups, respectively. In multivariable analysis including oncRNA risk status, T stage, nodal status, and RCB class, oncRNA high risk status retained significant association with lower EFS (HR 7.70, 95% CI 1.33-44.64, p=0.023) and OS (HR 7.99, 95% CI 1.36-47.00, p=0.022). Conclusion: EOT oncRNA liquid biopsy assay was independently prognostic for outcomes in TNBC patients with RD. More than half of patients in the oncRNA high-risk group suffered an EFS event by 3 years. oncRNA risk score has potential to provide prognostic utility complementary to clinicopathologic characteristics in patients with TNBC. These findings should be confirmed in other TNBC studies and may provide insights for patient stratification/selection in RD adjuvant therapy intensification trials. Citation Format: Rachel Yoder, Jennifer Yen, Mehran Karimzadeh, Joshua M. Staley, India Fernandez, Adam C. Heinrich, Fereydoun Hormozdiari, Jeffrey Gregg, Andrew K. Godwin, Irene Acerbi, Raaj Trivedi, Babak Alipanahi, Shane R. Stecklein, Priyanka Sharma. Orphan noncoding RNA (oncRNA) liquid biopsy assay is prognostic for survival in patients with triple-negative breast cancer (TNBC) and residual disease [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2024; Part 1 (Regular Abstracts); 2024 Apr 5-10; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2024;84(6_Suppl):Abstract nr 5215.
DOI: 10.1158/1538-7445.sabcs23-po2-13-08
2024
Abstract PO2-13-08: Cell-free orphan noncoding RNAs and AI enable early detection of invasive breast cancer and ductal carcinoma in-situ
Abstract Background Earlier detection of breast cancer through mammography screening has reduced disease-specific mortality; however, confounding issues such as technical challenges, breast density, and tumor size can result in false negatives and ultimately later stage diagnosis. Next generation liquid biopsy has the potential to complement mammography and enable earlier detection for more women. We have previously demonstrated high sensitivity and specificity for early detection of invasive breast cancer (IBC) by utilizing a novel category of cancer-associated small RNAs, termed orphan noncoding RNAs (oncRNAs), through a liquid biopsy platform. Here, we further improve the ability to detect breast cancer in a larger, multi-source cohort through an AI-driven approach and demonstrate potential for detection of ductal carcinoma in-situ (DCIS). Methods We utilized The Cancer Genome Atlas (TCGA) small RNA-seq database to discover a library of 20,538 oncRNAs, through a female-specific analysis, that were significantly enriched among 1,103 breast tumors compared to 349 normal tissue samples spanning multiple tissue sites. The diagnostic performance of these oncRNAs were assessed in an independent cohort of archived serum samples from 663 female individuals, sourced from Indivumed (Hamburg, Germany), Proteogenex (Inglewood, CA), and MT Group (Los Angeles, CA), including 279 breast cancer patients of various stages (221 IBC and 58 DCIS; mean age: 57.0 ± 13.8 years; ever-smoker: 25.8%) and 304 age-matched controls (mean age: 58.5 ± 13.9 years; ever-smoker: 23.4%) without breast cancer. All samples were collected between 2010–2022 at time of diagnosis for breast cancer patients. We sequenced the small RNA content of these samples at an average depth of 25.28 ± 9.37 million 50-bp single-end reads. We detected 18,025 (87.8%) unique breast cancer-specific oncRNA species within at least one sample from the study cohort. We then trained a generative AI model using 5-fold cross-validation to predict cancer status for all samples. Results Our oncRNA-based model achieved an overall AUC of 0.95 (95% CI, 0.93–0.97) for prediction of IBC versus cancer-free controls with a sensitivity of 0.87 (0.82–0.91) at 90% specificity. We observed high sensitivities, also at 90% specificity, across all tumor stages and tumor sizes (Table 1). Sensitivities for the earliest stage and smallest tumor size were 0.87 (0.78–0.93) and 0.81 (0.61–0.93) for Stage I (n=83) and T1a–b ( &amp;gt;1mm to ≤10mm; n=26), respectively. Additionally, in a small single-source cohort, we also saw high model accuracy and sensitivity for DCIS, which we aim to confirm in additional cohorts. While our overall cancer cohort primarily consisted of individuals with luminal breast cancer, our model had high sensitivities across all breast cancer subtypes at 0.90 (0.84–0.94), 0.73 (0.59–0.85), and 0.86 (0.42–1.0) for luminal (n=181), HER2 positive (n=49), and triple negative (n=7), respectively. Conclusions We further demonstrate the potential utility of oncRNAs as a blood-based biomarker using an AI algorithm for sensitive and accurate early detection of breast cancer in a large cohort. Additionally, we have shown that this oncRNA-based assay performs well in detecting small, early-stage invasive breast tumors, with potential to detect precursors of breast cancer. Table 1: Model sensitivity in breast cancer by tumor stage and size For each tumor stage and size, as defined by the AJCC 7th Edition breast cancer staging system, sensitivity and 95% Pearson-Clopper confidence intervals (CI) are reported at 90% specificity for the number of samples (N). Citation Format: Noura Tbeileh, Taylor Cavazos, Mehran Karimzadeh, Jeffrey Wang, Alice Huang, Dung Ngoc Lam, Seda Kilinc, Jieyang Wang, Xuan Zhao, Andy Pohl, Helen Li, Lisa Fish, Kimberly Chau, Marra Francis, Lee Schwartzberg, Patrick Arensdorf, Hani Goodarzi, Fereydoun Hormozdiari, Babak Alipanahi. Cell-free orphan noncoding RNAs and AI enable early detection of invasive breast cancer and ductal carcinoma in-situ [abstract]. In: Proceedings of the 2023 San Antonio Breast Cancer Symposium; 2023 Dec 5-9; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2024;84(9 Suppl):Abstract nr PO2-13-08.
DOI: 10.1142/s0219720011005276
2011
Cited 23 times
ERROR TOLERANT NMR BACKBONE RESONANCE ASSIGNMENT AND AUTOMATED STRUCTURE GENERATION
Error tolerant backbone resonance assignment is the cornerstone of the NMR structure determination process. Although a variety of assignment approaches have been developed, none works sufficiently well on noisy fully automatically picked peaks to enable the subsequent automatic structure determination steps. We have designed an integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins. IPASS employs probabilistic spin system typing based on chemical shifts and secondary structure predictions. Furthermore, IPASS extracts connectivity information from the inter-residue information and the (automatically picked) 15 N-edited NOESY peaks which are then used to fix reliable fragments. When applied to automatically picked peaks for real proteins, IPASS achieves an average precision and recall of 82% and 63%, respectively. In contrast, the next best method, MARS, achieves an average precision and recall of 77% and 36%, respectively. The assignments generated by IPASS are then fed into our protein structure calculation system, FALCON-NMR, to determine the 3D structures without human intervention. The final models have backbone RMSDs of 1.25Å, 0.88Å, 1.49Å, and 0.67Å to the reference native structures for proteins TM1112, CASKIN, VRAR, and HACS1, respectively. The web server is publicly available at .
DOI: 10.1016/j.patrec.2011.02.002
2011
Cited 21 times
Guided Locally Linear Embedding
Nonlinear dimensionality reduction is the problem of retrieving a low-dimensional representation of a manifold that is embedded in a high-dimensional observation space. Locally Linear Embedding (LLE), a prominent dimensionality reduction technique is an unsupervised algorithm; as such, it is not possible to guide it toward modes of variability that may be of particular interest. This paper proposes a supervised variation of LLE. Similar to LLE, it retrieves a low-dimensional global coordinate system that faithfully represents the embedded manifold. Unlike LLE, however, it produces an embedding in which predefined modes of variation are preserved. This can improve several supervised learning tasks including pattern recognition, regression, and data visualization.
DOI: 10.1186/s12864-016-3121-4
2016
Cited 16 times
Does conservation account for splicing patterns?
Alternative mRNA splicing is critical to proteomic diversity and tissue and species differentiation. Exclusion of cassette exons, also called exon skipping, is the most common type of alternative splicing in mammals.We present a computational model that predicts absolute (though not tissue-differential) percent-spliced-in of cassette exons more accurately than previous models, despite not using any 'hand-crafted' biological features such as motif counts. We achieve nearly identical performance using only the conservation score (mammalian phastCons) of each splice junction normalized by average conservation over 100 bp of the corresponding flanking intron, demonstrating that conservation is an unexpectedly powerful indicator of alternative splicing patterns. Using this method, we provide evidence that intronic splicing regulation occurs predominantly within 100 bp of the alternative splice sites and that conserved elements in this region are, as expected, functioning as splicing regulators. We show that among conserved cassette exons, increased conservation of flanking introns is associated with reduced inclusion. We also propose a new definition of intronic splicing regulatory elements (ISREs) that is independent of conservation, and show that most ISREs do not match known binding sites or splicing factors despite being predictive of percent-spliced-in.These findings suggest that one mechanism for the evolutionary transition from constitutive to alternative splicing is the emergence of cis-acting splicing inhibitors. The association of our ISREs with differences in splicing suggests the existence of novel RNA-binding proteins and/or novel splicing roles for known RNA-binding proteins.
DOI: 10.1101/2024.04.09.24304531
2024
Deep generative AI models analyzing circulating orphan non-coding RNAs enable accurate detection of early-stage non-small cell lung cancer
Abstract Liquid biopsies have the potential to revolutionize cancer care through non-invasive early detection of tumors, when the disease can be more effectively managed and cured. Developing a robust liquid biopsy test requires collecting high-dimensional data from a large number of blood samples across heterogeneous groups of patients. We propose that the generative capability of variational auto-encoders enables learning a robust and generalizable signature of blood-based biomarkers that capture true biological signals while removing spurious confounders (e.g., library size, zero-inflation, and batch effects). In this study, we analyzed orphan non-coding RNAs (oncRNAs) from serum samples of 1,050 individuals diagnosed with non-small cell lung cancer (NSCLC) at various stages, as well as sex-, age-, and BMI-matched controls to evaluate the potential use of deep generative models. We demonstrated that our multi-task generative AI model, Orion, surpassed commonly used methods in both overall performance and generalizability to held-out datasets. Orion achieved an overall sensitivity of 92% (95% CI: 85%–97%) at 90% specificity for cancer detection across all stages, outperforming the sensitivity of other methods such as support vector machine (SVM) classifier, ElasticNet, or XGBoost on held-out validation datasets by more than ∼30%.
DOI: 10.1142/s0219720010004987
2010
Cited 20 times
PROTEIN SECONDARY STRUCTURE PREDICTION USING NMR CHEMICAL SHIFT DATA
Accurate determination of protein secondary structure from the chemical shift information is a key step for NMR tertiary structure determination. Relatively few work has been done on this subject. There needs to be a systematic investigation of algorithms that are (a) robust for large datasets; (b) easily extendable to (the dynamic) new databases; and (c) approaching to the limit of accuracy. We introduce new approaches using k-nearest neighbor algorithm to do the basic prediction and use the BCJR algorithm to smooth the predictions and combine different predictions from chemical shifts and based on sequence information only. Our new system, SUCCES, improves the accuracy of all existing methods on a large dataset of 805 proteins (at 86% Q 3 accuracy and at 92.6% accuracy when the boundary residues are ignored), and it is easily extendable to any new dataset without requiring any new training. The software is publicly available at .
DOI: 10.1200/jco.2023.41.16_suppl.3051
2023
Detection of early-stage cancers using circulating orphan non-coding RNAs in blood.
3051 Background: Orphan non-coding RNAs (oncRNAs) are a novel category of small RNAs (smRNAs) that are present in tumors and largely absent in healthy tissue. We investigated the utility of oncRNAs extracted from serum for early cancer detection across seven cancer types. Methods: We collected 2,882 serum samples from individuals with known cancers of the bladder ( n=152), breast (220), colon and rectum (141), kidney (283), lung (281), pancreas (287), and stomach (280) as well as donors with no history of cancer (1,238). We used 0.5 mL serum aliquots to generate and sequence smRNA libraries at an average depth of 20 million 50-bp single-end reads. Samples were split into age-, sex-, and smoking status-matched training (1,232 cancer; 922 control) and validation (412 cancer; 316 control) cohorts. A large catalog of oncRNAs specific to each cancer was created using tumor and adjacent normal samples from The Cancer Genome Atlas (TCGA) smRNA-seq database. Using TCGA-derived oncRNAs, we trained a machine learning model to predict cancer presence and tissue of origin (TOO) in a 5-fold cross validation setup using our training cohort. For the validation cohort, we averaged the predictions from the five training cohort models. Results: The model ROC-AUC for detecting cancer was 0.95 (95% CI: 0.94–0.95 for training and 0.94–0.97 for validation cohorts). Sensitivities for detecting cancer at 95% specificity were 0.74 (0.70–0.76) for early stage (I/II) and 0.80 (0.76–0.84) for late stage (III/IV) cancers in the training cohort, and 0.77 (0.71–0.81) and 0.81 (0.73–0.87) in the validation cohort. Sensitivities of detection for each cancer type are shown. For samples with cancer and TOO predictions, our top 1 and top 2 TOO accuracy was 0.76 (0.68–0.84) and 0.83 (0.76–0.90) for the validation set. Conclusions: These results demonstrate that oncRNAs detected in serum can be used for accurate, early detection, and localization of multiple cancers. [Table: see text]
DOI: 10.1016/j.jtho.2023.09.258
2023
MA19.09 AI-Based Early Detection and Subtyping of Non-Small Cell Lung Cancer from Blood Samples Using Orphan Noncoding RNAs
DOI: 10.22362/ijcert/2023/v10/i11/v10i114
2023
A Machine Learning-based Approach for Detecting Fake News in Online Media
DOI: 10.1101/2020.12.14.20224378
2020
Cited 5 times
Genome-wide association studies of LRRK2 modifiers of Parkinson's disease
Objective: The aim of this study was to search for genes/variants that modify the effect of LRRK2 mutations in terms of penetrance and age-at-onset of Parkinson's disease. Methods: We performed the first genome-wide association study of penetrance and age-at-onset of Parkinson's disease in LRRK2 mutation carriers (776 cases and 1,103 non-cases at their last evaluation). Cox proportional hazard models and linear mixed models were used to identify modifiers of penetrance and age-at-onset of LRRK2 mutations, respectively. We also investigated whether a polygenic risk score derived from a published genome-wide association study of Parkinson's disease was able to explain variability in penetrance and age-at-onset in LRRK2 mutation carriers. Results: A variant located in the intronic region of CORO1C on chromosome 12 (rs77395454; P-value=2.5E-08, beta=1.27, SE=0.23, risk allele: C) met genome-wide significance for the penetrance model. A region on chromosome 3, within a previously reported linkage peak for Parkinson's disease susceptibility, showed suggestive associations in both models (penetrance top variant: P-value=1.1E-07; age-at-onset top variant: P-value=9.3E-07). A polygenic risk score derived from publicly available Parkinson's disease summary statistics was a significant predictor of penetrance, but not of age-at-onset. Interpretation: This study suggests that variants within or near CORO1C may modify the penetrance of LRRK2 mutations. In addition, common Parkinson's disease associated variants collectively increase the penetrance of LRRK2 mutations.
DOI: 10.1049/iet-com:20070193
2008
Cited 6 times
Channel estimation for time-hopping pulse position modulation ultra-wideband communication systems
Ultra-wideband (UWB) communication systems are used in indoor environments with dense multi-path characteristics. Therefore channel estimation has an important role in the receiver of these systems. A new approach for data-aided (DA) and non-data-aided (NDA) channel estimation is proposed, which is called the pulse compression (PC) method. This method is useful for UWB systems employing time-hopping pulse position modulation. The PC method requires only some basic operations such as sampling, overlap-add and finite impulse response filtering. The PC method, in both DA and NDA scenarios, in spite of its low complexity, outperforms the maximum-likelihood (ML) method in channel parameters estimation. The bit error rate (BER) of the DA method, in single-user scenario, performs as well as the ML method, and in multi-user scenario, in the worst case, there is only 0.5 dB loss compared with the ML method. In the case of NDA scenario, the proposed method outperforms the NDA-ML method, that is, in the single-user scenario about 4 dB gain at the BER of 10−3 is observed. In multi-user scenario, it outperforms significantly the NDA-ML method, and its performance loss in comparison with the perfect channel knowledge scenario is about 3 dB at the BER of 10−3.
DOI: 10.1109/icics.2005.1689053
2006
Cited 6 times
A New Approach for UWB Channel Estimation
In this paper, a new channel estimation approach is proposed for ultra-wideband (UWB) communication systems. Performance of UWB systems employing RAKE receivers highly depends on the channel knowledge. Because UWB systems operate in dense multipath environments, channel estimation task becomes more complicated. Some channel estimation methods have been proposed for UWB systems. Here a very simple and cost effective channel estimation method is presented. We develop a data-aided estimation of channel parameters in UWB systems employing time-hopping pulse-position-modulation (TH-PPM). This method can be used in time-hopping pulse-amplitude-modulation (TH-PAM) and direct-sequence (DS) spread-spectrum systems as well. Computer simulations show that the proposed method is competitive with the maximum likelihood (ML) method, in both single-user and multi-user scenarios, with less computational complexity and the ability to extract the exact parameters of the impulse response of the channel. The latter property does not exist in the ML method in dense multipath environments
DOI: 10.1186/1748-7188-8-5
2013
Cited 3 times
Protein Structure Idealization: How accurately is it possible to model protein structures with dihedral angles?
Abstract Background Previous studies show that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles. Here, we introduce the protein structure idealization problem. We focus on the protein backbone structure idealization. We describe a fast O ( n m / ε ) dynamic programming algorithm to find an idealized protein backbone structure that is approximately optimal according to our scoring function. The scoring function evaluates not only the free energy, but also the similarity with the target structure. Thus, the idealized protein structures found by our algorithm are guaranteed to be protein-like and close to the target protein structure. We have implemented our protein structure idealization algorithm and idealized the high resolution protein structures with low sequence identities of the CULLPDB_PC30_RES1.6_R0.25 data set. We demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also applied our algorithm to refine protein pseudo-structures determined in NMR experiments.
DOI: 10.1007/978-3-642-29627-7_1
2012
Protein Structure by Semidefinite Facial Reduction
All practical contemporary protein NMR structure determination methods use molecular dynamics coupled with a simulated annealing schedule. The objective of these methods is to minimize the error of deviating from the NOE distance constraints. However, this objective function is highly nonconvex and, consequently, difficult to optimize. Euclidean distance geometry methods based on semidefinite programming (SDP) provide a natural formulation for this problem. However, complexity of SDP solvers and ambiguous distance constraints are major challenges to this approach. The contribution of this paper is to provide a new SDP formulation of this problem that overcomes these two issues for the first time. We model the protein as a set of intersecting two- and three-dimensional cliques, then we adapt and extend a technique called semidefinite facial reduction to reduce the SDP problem size to approximately one quarter of the size of the original problem. The reduced SDP problem can not only be solved approximately 100 times faster, but is also resistant to numerical problems from having erroneous and inexact distance bounds.
DOI: 10.1158/1538-7445.sabcs22-p1-05-18
2023
Abstract P1-05-18: Orphan non-coding RNAs for early detection of breast cancer with liquid biopsy
Abstract Background: Early detection of breast cancer is crucial for optimal patient outcomes but cannot always be accomplished based on symptoms or screening mammography. Biomarker-based screening could aid early detection of breast cancer by improving sensitivity and specificity. Exai Bio has developed a novel liquid biopsy technology that detects and analyzes small non-coding RNAs that are cancer specific, termed orphan non-coding RNAs (oncRNAs). Previous work in patients with diagnosed breast cancer demonstrated that changes in oncRNAs in serum reflected treatment response and event-free survival. In this study, we developed an assay that measures oncRNAs in serum to detect breast cancer across the range of tumor stages and sizes. Methods: Previously, a library of ~260,000 oncRNAs from 32 different cancers was compiled based on smRNA sequences found in tumor tissues and largely absent in tumor-adjacent normal tissues from The Cancer Genome Atlas (TCGA). To refine this library for applications in serum, we sequenced smRNA in 31 control serum samples. These smRNA sequences were filtered from the larger library, reducing its size to 250,332 oncRNAs. The diagnostic performance of these oncRNAs was then assessed in an independent cohort of archived serum samples from 96 female patients with clinically diagnosed, untreated breast cancer and 95 age- and sex-matched individuals with no known history of cancer. We sequenced smRNAs at an average depth of 17.7 million 50-bp single-end reads per sample. Of the 250,332 oncRNAs in our library, 171,981 (68.7%) were detected in our independent study cohort. An ensemble of logistic regression models was trained with 5-fold cross-validation, using only those oncRNAs yielding an odds ratio &amp;gt;1 and observed in &amp;gt;6% of samples within each training set. Results: The cohort of 96 breast cancer patients and 95 matched controls had mean ages of 59.4 and 56.3 years, respectively. Area under the receiver operating characteristic curve (AUC) for detecting breast cancer was 0.94 (95% CI, 0.85–0.96). Sensitivities for detecting breast cancer at 95% specificity ranged from 0.75 to 0.87 among the four breast cancer stages, including a sensitivity of 0.81 for tumor stage I (Table 1); and from 0.67 to 0.87 among the four main TNM T categories (Table 2). Sensitivities at 95% specificity were relatively high for small tumors, at 0.75 (95% CI, 0.40–0.97) for T1b (&amp;gt;5mm to ≤10mm; n = 9) and 0.80 (0.68–0.94) for T1c (&amp;gt;10mm to ≤20mm; n = 37). Conclusions We have demonstrated the potential value of an oncRNA-based liquid biopsy assay by showing that oncRNAs can be used to detect breast cancer in serum samples with high sensitivity, and that detection requires fewer reads than are needed with other platforms. Moreover, we found that this oncRNA-based assay performed well in detecting early-stage breast cancer and small tumors. This suggests that an oncRNA-based liquid biopsy assay may be beneficial for early detection of breast cancer. Table 1. Model sensitivity by tumor stage. For the indicated numbers of cases (N), sensitivity and Pearson-Clopper 95% confidence intervals are reported for tumor detection by the oncRNA-based model at 95% specificity by tumor stage, as defined by the AJCC 7th Edition breast cancer staging system. Table 2. Model sensitivity by tumor size. For the indicated numbers of cases (N), sensitivity and Pearson-Clopper 95% confidence intervals are reported for tumor detection by the oncRNA-based model at 95% specificity by TNM T category, as defined by the AJCC 7th Edition breast cancer staging system. Citation Format: Taylor B. Cavazos, Jeffrey Wang, Oluwadamilare I. Afolabi, Alice Huang, Dung Ngoc Lam, Seda Kilinc, Jieyang Wang, Lisa Fish, Xuan Zhao, Andy Pohl, Helen Li, Kimberly H. Chau, Patrick A. Arensdorf, Fereydoun Hormozdiari, Hani Goodarzi, Babak Alipanahi. Orphan non-coding RNAs for early detection of breast cancer with liquid biopsy [abstract]. In: Proceedings of the 2022 San Antonio Breast Cancer Symposium; 2022 Dec 6-10; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2023;83(5 Suppl):Abstract nr P1-05-18.
DOI: 10.1158/1538-7445.am2023-5711
2023
Abstract 5711: Blood-based early detection of non-small cell lung cancer using orphan noncoding RNAs
Abstract Background: Orphan non-coding RNAs (oncRNAs) are a novel category of small non-coding RNAs that are present in the tumor tissue and blood of people with cancer and largely absent in people without cancer. To examine the potential of using oncRNAs for early cancer detection via liquid biopsy, we assessed the oncRNA content of serum from people with and without non-small cell lung cancer (NSCLC) and developed a prediction model for NSCLC. Methods: A total of 540 serum samples were obtained from Indivumed (Hamburg, Germany) and MT Group (Los Angeles, CA) and divided into cohort A for training (150 NSCLC cases, 219 controls; female: 30.7%/36.1%; mean age: 67.9 ± 8.9/62.4 ± 9.2; ever-smoker: 95.3%/26.9%, respectively) and cohort B for internal validation (88 NSCLC cases, 83 controls; female: 40.9%/54.2%; mean age: 62.7 ± 9.2/54.1 ± 12.4; ever-smoker: 89.8%/6.0%, respectively). We used RNA isolated from 0.5 mL of serum to generate and sequence libraries at an average depth of 18.5 ± 6.5 million 50-bp single-end reads using next-generation sequencing. Previously, we created a large catalog of NSCLC oncRNAs found in 999 NSCLC tumor tissues and largely absent in 679 normal samples from The Cancer Genome Atlas (TCGA) smRNA-seq database. This catalog was distilled by removing smRNA species found in the serum of an independent cohort of 31 non-cancer donors to yield a final NSCLC oncRNA catalog of 81,004 distinct oncRNA species. This distilled catalog was the reference for identifying NSCLC oncRNAs in the present study. Using oncRNA data we generated by sequencing samples from cohort A, we trained a logistic regression model for predicting NSCLC presence. The model was then validated in cohort B. Results: From the 540 samples sequenced, we detected 64,379 oncRNAs from the distilled TCGA oncRNA catalog in at least one sample across both cohorts (A: 55,650, B: 47,539).Using 5-fold cross-validation, the AUC of the logistic regression model was 0.95 (95% CI: 0.93-0.97) for the training cohort, and was 0.98 (0.97-0.99) for the validation cohort. Sensitivities for detecting NSCLC at 95% specificity were 0.78 (0.69-0.86) for early stage (I/II) cancer and 0.75 (0.60-0.87) for late stage (III/IV) cancer in the training cohort, and 0.92 (0.83-0.98) and 1.0 (0.85-1.0), respectively, in the validation cohort. Conclusion: These results demonstrate the potential for accurate, sensitive, and early detection of NSCLC through sequencing the oncRNA content of a routine blood draw. The performance of the model trained on one cohort and internally validated in a separate cohort supports the generalizability of this approach in detecting NSCLC. Citation Format: Mehran Karimzadeh, Jeffrey Wang, Aiden Sababi, Oluwadamilare I. Afolabi, Dung Ngoc Lam, Alice Huang, Diana R. Corti, Kristle C. Garcia, Seda Kilinc, Xuan Zhao, Jieyang Wang, Taylor B. Cavazos, Patrick Arensdorf, Kimberly H. Chau, Helen Li, Hani Goodarzi, Lisa Fish, Fereydoun Hormozdiari, Babak Alipanahi. Blood-based early detection of non-small cell lung cancer using orphan noncoding RNAs. [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2023; Part 1 (Regular and Invited Abstracts); 2023 Apr 14-19; Orlando, FL. Philadelphia (PA): AACR; Cancer Res 2023;83(7_Suppl):Abstract nr 5711.
DOI: 10.1200/jco.2023.41.16_suppl.e15651
2023
Consensus molecular subtypes and orphan non-coding RNAs in colorectal cancer.
e15651 Background: Colorectal cancer (CRC) is a highly heterogeneous disease with variable response to different treatment strategies. The consensus molecular subtype (CMS) classification system has been proposed as a gene expression-based framework aimed at capturing the genetic and molecular heterogeneity in CRCs and potentially stratifying patients into clinically relevant subgroups for treatment selection. We previously demonstrated that orphan non-coding RNAs (oncRNAs), a novel class of cancer enriched small non-coding RNAs (smRNA), exhibit cancer-specific expression. We also showed that oncRNAs can be used in a liquid biopsy assay for accurate early-stage detection of CRCs. We hypothesize that oncRNAs may be informative of CRC subtype-specific transcriptional signatures as defined by the CMS groups. Methods: To assess the effectiveness of oncRNAs for categorization of CMS group, we examined smRNA-seq data from The Cancer Genome Atlas (TCGA) for colon and rectal adenocarcinoma tumors ( n= 504), consisting of 77 CMS1, 216 CMS2, 71 CMS3, and 140 CMS4 tumors. CRC-specific oncRNA expression profiles were generated for all tumors in the TCGA cohort. We then trained a machine learning model on oncRNA expression profiles to predict CMS groups using a 5-fold cross validation setup. For our validation cohort, we applied our model to an independent CRC cohort ( n= 348) consisting of 56 CMS1, 101 CMS2, 59 CMS3, and 132 CMS4 CRC samples. Final CMS predictions for the validation cohort were made by averaging across the five training models. Results: Within TCGA testing folds, the model micro-averaged ROC-AUC for CMS classifications were 0.879 (95% CI: 0.86–0.90). In the validation cohort, micro-averaged AUC remained high at 0.915 (0.90–0.93). Sensitivities and AUCs for each subtype are provided in Table 1. The highest model-predicted likelihood for each CMS subtype had an accuracy of 73.6% (68.6%–78.1%), and the two most probable predictions had an accuracy of 91.1% (87.6%–93.9%). Conclusions: This study demonstrates the utility of oncRNA expression profiling to predict CMS for colorectal cancer. The generalizability of our machine learning model suggests that oncRNAs are informative of the biological differences across distinct CMS groups and may be used as a proxy for gene expression signatures to stratify colorectal cancers. Because smRNAs are more likely to be secreted into blood, we posit that an oncRNA-based liquid biopsy may open new opportunities to monitor patients and track changes in their tumor biology in real-time throughout their course of disease and treatment. [Table: see text]
DOI: 10.1101/2023.12.01.569227
2023
High-throughput ML-guided design of diverse single-domain antibodies against SARS-CoV-2
Abstract Treating rapidly evolving pathogenic diseases such as COVID-19 requires a therapeutic approach that accommodates the emergence of viral variants over time. Our machine learning (ML)-guided sequence design platform combines high-throughput experiments with ML to generate highly diverse single-domain antibodies (VHHs) that bind and neutralize SARS-CoV-1 and SARS-CoV-2. Crucially, the model, trained using binding data against early SARS-CoV variants, accurately captures the relationship between VHH sequence and binding activity across a broad swathe of sequence space. We discover ML-designed VHHs that exhibit considerable cross-reactivity and successfully neutralize targets not seen during training, including the Delta and Omicron BA.1 variants of SARS-CoV-2. Our ML-designed VHHs include thousands of variants 4-15 mutations from the parent sequence with significantly improved activity, demonstrating that ML-guided sequence design can successfully navigate vast regions of sequence space to unlock and future-proof potential therapeutics against rapidly evolving pathogens.
DOI: 10.1038/s42003-020-0880-x
2020
Author Correction: Genome-wide association study of knee pain identifies associations with GDF5 and COL27A1 in UK Biobank
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
DOI: 10.1038/s41467-021-22411-w
2021
Author Correction: Genome-wide association study of depression phenotypes in UK Biobank identifies variants in excitatory synaptic pathways
A Correction to this paper has been published: https://doi.org/10.1038/s41467-021-22411-w
2011
New Approaches to Protein NMR Automation
The three-dimensional structure of a protein molecule is the key to understanding its biological and physiological properties. A major problem in bioinformatics is to efficiently determine the three-dimensional structures of query proteins. Protein NMR structure de- termination is one of the main experimental methods and is comprised of: (i) protein sample production and isotope labelling, (ii) collecting NMR spectra, and (iii) analysis of the spectra to produce the protein structure. In protein NMR, the three-dimensional struc- ture is determined by exploiting a set of distance restraints between spatially proximate atoms. Currently, no practical automated protein NMR method exists that is without human intervention. We first propose a complete automated protein NMR pipeline, which can efficiently be used to determine the structures of moderate sized proteins. Second, we propose a novel and efficient semidefinite programming-based (SDP) protein structure determination method. The proposed automated protein NMR pipeline consists of three modules: (i) an automated peak picking method, called PICKY, (ii) a backbone chemical shift assign- ment method, called IPASS, and (iii) a protein structure determination method, called FALCON-NMR. When tested on four real protein data sets, this pipeline can produce structures with reasonable accuracies, starting from NMR spectra. This general method can be applied to other macromolecule structure determination methods. For example, a promising application is RNA NMR-assisted secondary structure determination. In the second part of this thesis, due to the shortcomings of FALCON-NMR, we propose a novel SDP-based protein structure determination method from NMR data, called SPROS. Most of the existing prominent protein NMR structure determination methods are based on molecular dynamics coupled with a simulated annealing schedule. In these methods, an objective function representing the error between observed and given distance restraints is minimized; these objective functions are highly non-convex and difficult to optimize. Euclidean distance geometry methods based on SDP provide a natural formulation for realizing a three-dimensional structure from a set of given distance constraints. However, the complexity of the SDP solvers increases cubically with the input matrix size, i.e., the number of atoms in the protein, and the number of constraints. In fact, the complexity of SDP solvers is a major obstacle in their applicability to the protein NMR problem. To overcome these limitations, the SPROS method models the protein molecule as a set of intersecting two- and three-dimensional cliques. We adapt and extend a technique called semidefinite facial reduction for the SDP matrix size reduction, which makes the SDP problem size approximately one quarter of the original problem. The reduced problem is solved nearly one hundred times faster and is more robust against numerical problems. Reasonably accurate results were obtained when SPROS was applied to a set of 20 real protein data…
DOI: 10.1109/icc.2006.255385
2006
A Blind Channel Estimation Technique for TH-PPM UWB Systems
As the UWB systems are designed to operate in environments having dense multipath characteristics (i.e., indoor environments) the channel estimation task is much more involved than the conventional systems. In this paper, we present a new blind channel estimation technique for Time-Hopping Pluse-Position-Modulation (TH-PPM) systems. Our proposed method is based on the Pulse Compression (PC) technique introduced in [1]. In spite of its low complexity, the proposed blind channel estimation method outperforms the Non-Data Aided Maximum Likelihood (NDA-ML) technique. Our simulation results show the promising gain of 4 dB around the Bit Error Rate (BER) of 10-3, compared to NDA-ML. The performance loss, compared to the Perfect Channel Knowledge (PCK) is only 1.5 dB at this BER.
DOI: 10.1109/wocn.2005.1436029
2005
A novel soft handoff algorithm for fair network resources distribution
In this paper, we propose a novel and simple soft hand off algorithm that helps fair distribution of network resources among users. Due to centralized soft handoff mechanism, no further computational complexity is added to the mobile sets. Because we use fuzzy inference system (FIS) in adapting thresholds of our algorithm, it releases the traffic channels at high traffic load for increasing the carried traffic. This paper compares the performance of our algorithms with IS-95A soft handoff algorithm by using following performance indicators: outage probability, new call blocking probability, handoff call blocking probability, expected number of base station in active set, expected number of changes in active set, and network carried traffic. By comparison of all parameters among proposed algorithm and IS-95A, our algorithm tend to have a higher performance than IS-95A soft handoff, especially in new call requests blocking probability and network carried traffic.
DOI: 10.1090/conm/622/12429
2014
Manifold unfolding by Isometric Patch Alignment with an application in protein structure determination
DOI: 10.1007/978-3-642-33122-0_22
2012
How Accurately Can We Model Protein Structures with Dihedral Angles?
Previous study shows that the same type of bond lengths and angles fit Gaussian distributions well with small standard deviations on high resolution protein structure data. The mean values of these Gaussian distributions have been widely used as ideal bond lengths and angles in bioinformatics. However, we are not aware of any research work done to evaluate how accurately we can model protein structures with dihedral angles and ideal bond lengths and angles. In this paper, we first introduce the protein structure idealization problem. Then, we develop a fast O(nm / ε) dynamic programming algorithm to find an approximately optimal idealized protein backbone structure according to our scoring function. Consequently, we demonstrate that idealized backbone structures always exist with small changes and significantly better free energy. We also apply our algorithm to refine protein pseudo-structures determined in NMR experiments.
DOI: 10.1158/1538-7445.am2022-3353
2022
Abstract 3353: Discovery and validation of orphan noncoding RNA profiles across multiple cancers in TCGA and two independent cohorts
Abstract Small non-coding RNAs (sncRNAs) have established roles as post-transcriptional regulators of cancer pathogenesis. We recently reported a novel and previously unannotated class of cancer-specific sncRNAs in breast cancer and demonstrated that breast cancer cells exploit a specific sncRNA to promote cancer metastasis. However, the extent to which these sncRNAs, which we have collectively termed orphan non-coding RNAs (oncRNAs), are present in other cancer types is unknown. To address this question and define a high-confidence set of oncRNAs, we used smRNA-seq data from 6 cancer sites (breast, colorectal, kidney, liver, lung, and stomach) and their corresponding normal tissues from The Cancer Genome Atlas (TCGA; 4,445 cancer, 431 normal) and identified a total of 144,695 oncRNAs that are significantly present in cancer and largely absent in normal tissue (Fisher’s Exact Test and Benjamini-Hochberg correction, FDR &amp;lt; 0.1). To evaluate if this set of TCGA-derived oncRNAs could be validated in independent datasets, we examined smRNA-seq data from two large independent cohorts comprising these same cancer and normal tissue types (Indivumed, Hamburg, Germany). Cohort A consists of 4,024 samples (2,245 cancer, 1,779 normal) and cohort B consists of 2,874 samples (2,063 cancer; 811 normal). oncRNAs in these cohorts were annotated following the same procedure used for TCGA data. TCGA-derived oncRNAs were considered validated in the independent cohorts if they were present in a significantly higher number of cancer samples compared to adjacent normal tissue samples. In cohort A, 140,191 (96.9%) of TCGA-derived oncRNAs were detected in at least one sample, of which 74,634 (51.6%) were validated as oncRNAs. In cohort B, 140,147 (96.9%) oncRNAs were observed and 68,366 (47.2%) were validated. The degree of overlap between the validated oncRNAs in each cohort was significant, with 54,294 (37.5%) overlapping oncRNAs (hypergeometric test, P=0). We also found that oncRNAs are informative of cancer tissue of origin, demonstrating the existence of consistent cancer-specific oncRNA expression profiles in independent studies. Using the TCGA-derived oncRNAs as features, we trained an eXtreme Gradient Boosting (XGB) model on TCGA data to classify cancer samples by the 6 tissues of origin. The TCGA-trained model showed high performance when evaluated on both cohorts A and B, achieving accuracies of 91.5% (95% CI: 90.3%-92.7%) and 96% (94.7%-97%), respectively. For comparison, this model achieved an accuracy of 96% (94.5%-97.2%) on held-out TCGA data (80/20 train/test split). Our results show a robust validation of TCGA-derived oncRNAs in external, independently sourced and processed cancer tissue cohorts across a heterogeneous set of cancer sites. Our machine learning model also demonstrates that oncRNA profiles can be used to predict cancer tissue of origin with high generalizability and accuracy. Citation Format: Jeffrey Wang, Helen Li, Lisa Fish, Kimberly H. Chau, Patrick Arensdorf, Hani Goodarzi, Babak Alipanahi. Discovery and validation of orphan noncoding RNA profiles across multiple cancers in TCGA and two independent cohorts [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 3353.
DOI: 10.1038/s41467-022-33222-y
2022
Author Correction: Genome-wide association and epidemiological analyses reveal common genetic origins between uterine leiomyomata and endometriosis
DOI: 10.1101/414854
2018
Genome-wide association studies of impulsive personality traits (BIS-11 and UPPSP) and drug Experimentation in up to 22,861 adult research participants
Abstract Background Impulsive personality traits are complex heritable traits that are governed by frontal-subcortical circuits and are associated with numerous neuropsychiatric disorders, particularly drug abuse. Methods In collaboration with the genetics company 23andMe, Inc., we performed several genome-wide association studies ( GWAS ) on measures of impulsive personality traits (the short version of the UPPSP Impulsive Behavior Scale, and the Barratt Impulsiveness Scale [BIS-11]) and drug experimentation (the number of drug classes an individual has tried in their lifetime) in up to 22,861 male and female adult research participants of European ancestry. Results Impulsive personality traits and drug experimentation showed SNP-heritabilities that ranged from 5 to 11%. Genetic variants in the CADM2 locus were significantly associated with the UPPSP Sensation Seeking subscale ( P = 8.3 × 10 -9 , rs139528938) and showed a suggestive association with drug experimentation ( P = 3.0 × 10 -7 , rs2163971; r 2 = 0.68 with rs139528938); CADM2 has been previously associated with measures of risky behaviors and self-reported risk tolerance, cannabis initiation, alcohol consumption, as well as information speed processing, body mass index (BMI) variation and obesity. Furthermore, genetic variants in the CACNA1I locus were significantly associated with the UPPSP Negative Urgency subscale ( P = 3.8 × 10 -8 , rs199694726). Multiple subscales from both UPPSP and BIS showed strong genetic correlations (&gt;0.5) with drug experimentation and other substance use traits measured in independent cohorts, including smoking initiation, and lifetime cannabis use. Several UPPSP and BIS subscales were genetically correlated with attention-deficit/hyperactivity disorder (r g = 0.30-0.51, p &lt; 8.69 x 10 -3 ), supporting their validity as endophenotypes. Conclusions Our findings demonstrate a role for common genetic contributions to individual differences in impulsivity. Furthermore, our study is the first to provide a genetic dissection of the relationship between different types of impulsive personality traits and various psychiatric disorders.
DOI: 10.48550/arxiv.2011.13012
2020
Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous optic nerve head features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; $P\leq5\times10^{-8}$) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 92 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR and primary open-angle glaucoma in the independent EPIC-Norfolk cohort.