ϟ

Iván Olier

Here are all the papers by Iván Olier that you can download and read on OA.mg.
Iván Olier’s last known institution is . Download Iván Olier PDFs here.

Claim this Profile →
DOI: 10.3390/w12071885
2020
Cited 133 times
Urban Water Demand Prediction for a City That Suffers from Climate Change and Population Growth: Gauteng Province Case Study
The proper management of a municipal water system is essential to sustain cities and support the water security of societies. Urban water estimating has always been a challenging task for managers of water utilities and policymakers. This paper applies a novel methodology that includes data pre-processing and an Artificial Neural Network (ANN) optimized with the Backtracking Search Algorithm (BSA-ANN) to estimate monthly water demand in relation to previous water consumption. Historical data of monthly water consumption in the Gauteng Province, South Africa, for the period 2007–2016, were selected for the creation and evaluation of the methodology. Data pre-processing techniques played a crucial role in the enhancing of the quality of the data before creating the prediction model. The BSA-ANN model yielded the best result with a root mean square error and a coefficient of efficiency of 0.0099 mega liters and 0.979, respectively. Moreover, it proved more efficient and reliable than the Crow Search Algorithm (CSA-ANN), based on the scale of error. Overall, this paper presents a new application for the hybrid model BSA-ANN that can be successfully used to predict water demand with high accuracy, in a city that heavily suffers from the impact of climate change and population growth.
DOI: 10.1371/journal.pone.0099825
2014
Cited 150 times
ClinicalCodes: An Online Clinical Codes Repository to Improve the Validity and Reproducibility of Research Using Electronic Medical Records
Lists of clinical codes are the foundation for research undertaken using electronic medical records (EMRs). If clinical code lists are not available, reviewers are unable to determine the validity of research, full study replication is impossible, researchers are unable to make effective comparisons between studies, and the construction of new code lists is subject to much duplication of effort. Despite this, the publication of clinical codes is rarely if ever a requirement for obtaining grants, validating protocols, or publishing research. In a representative sample of 450 EMR primary research articles indexed on PubMed, we found that only 19 (5.1%) were accompanied by a full set of published clinical codes and 32 (8.6%) stated that code lists were available on request. To help address these problems, we have built an online repository where researchers using EMRs can upload and download lists of clinical codes. The repository will enable clinical researchers to better validate EMR studies, build on previous code lists and compare disease definitions across studies. It will also assist health informaticians in replicating database studies, tracking changes in disease definitions or clinical coding practice through time and sharing clinical code information across platforms and data sources as research objects.
DOI: 10.1007/s10334-008-0146-y
2008
Cited 124 times
Multiproject–multicenter evaluation of automatic brain tumor classification by magnetic resonance spectroscopy
Automatic brain tumor classification by MRS has been under development for more than a decade. Nonetheless, to our knowledge, there are no published evaluations of predictive models with unseen cases that are subsequently acquired in different centers. The multicenter eTUMOUR project (2004–2009), which builds upon previous expertise from the INTERPRET project (2000–2002) has allowed such an evaluation to take place. A total of 253 pairwise classifiers for glioblastoma, meningioma, metastasis, and low-grade glial diagnosis were inferred based on 211 SV short TE INTERPRET MR spectra obtained at 1.5 T (PRESS or STEAM, 20–32 ms) and automatically pre-processed. Afterwards, the classifiers were tested with 97 spectra, which were subsequently compiled during eTUMOUR. In our results based on subsequently acquired spectra, accuracies of around 90% were achieved for most of the pairwise discrimination problems. The exception was for the glioblastoma versus metastasis discrimination, which was below 78%. A more clear definition of metastases may be obtained by other approaches, such as MRSI + MRI. The prediction of the tumor type of in-vivo MRS is possible using classifiers developed from previously acquired data, in different hospitals with different instrumentation under the same acquisition protocols. This methodology may find application for assisting in the diagnosis of new brain tumor cases and for the quality control of multicenter MRS databases.
DOI: 10.1136/bmjopen-2015-009010
2015
Cited 98 times
Inequalities in physical comorbidity: a longitudinal comparative cohort study of people with severe mental illness in the UK
Little is known about the prevalence of comorbidity rates in people with severe mental illness (SMI) in UK primary care. We calculated the prevalence of SMI by UK country, English region and deprivation quintile, antipsychotic and antidepressant medication prescription rates for people with SMI, and prevalence rates of common comorbidities in people with SMI compared with people without SMI.Retrospective cohort study from 2000 to 2012.627 general practices contributing to the Clinical Practice Research Datalink, a UK primary care database.Each identified case (346,551) was matched for age, sex and general practice with 5 randomly selected control cases (1,732,755) with no diagnosis of SMI in each yearly time point.Prevalence rates were calculated for 16 conditions.SMI rates were highest in Scotland and in more deprived areas. Rates increased in England, Wales and Northern Ireland over time, with the largest increase in Northern Ireland (0.48% in 2000/2001 to 0.69% in 2011/2012). Annual prevalence rates of all conditions were higher in people with SMI compared with those without SMI. The discrepancy between the prevalence of those with and without SMI increased over time for most conditions. A greater increase in the mean number of additional conditions was observed in the SMI population over the study period (0.6 in 2000/2001 to 1.0 in 2011/2012) compared with those without SMI (0.5 in 2000/2001 to 0.6 in 2011/2012). For both groups, most conditions were more prevalent in more deprived areas, whereas for the SMI group conditions such as hypothyroidism, chronic kidney disease and cancer were more prevalent in more affluent areas.Our findings highlight the health inequalities faced by people with SMI. The provision of appropriate timely health prevention, promotion and monitoring activities to reduce these health inequalities are needed, especially in deprived areas.
DOI: 10.3390/w12061628
2020
Cited 96 times
A Novel Methodology for Prediction Urban Water Demand by Wavelet Denoising and Adaptive Neuro-Fuzzy Inference System Approach
Accurate and reliable urban water demand prediction is imperative for providing the basis to design, operate, and manage water system, especially under the scarcity of the natural water resources. A new methodology combining discrete wavelet transform (DWT) with an adaptive neuro-fuzzy inference system (ANFIS) is proposed to predict monthly urban water demand based on several intervals of historical water consumption. This ANFIS model is evaluated against a hybrid crow search algorithm and artificial neural network (CSA-ANN), since these methods have been successfully used recently to tackle a range of engineering optimization problems. The study outcomes reveal that (1) data preprocessing is essential for denoising raw time series and choosing the model inputs to render the highest model performance; (2) both methodologies, ANFIS and CSA-ANN, are statistically equivalent and capable of accurately predicting monthly urban water demand with high accuracy based on several statistical metric measures such as coefficient of efficiency (0.974, 0.971, respectively). This study could help policymakers to manage extensions of urban water system in response to the increasing demand with low risk related to a decision.
DOI: 10.1016/j.ejps.2021.105765
2021
Cited 55 times
Chitosan nanoparticles for enhancing drugs and cosmetic components penetration through the skin
Chitosan nanoparticles (CT NPs) have attractive biomedical applications due to their unique properties. This present research aimed at development of chitosan nanoparticles to be used as skin delivery systems for cosmetic components and drugs and to track their penetration behaviour through pig skin. CT NPs were prepared by ionic gelation technique using sodium tripolyphosphate (TPP) and Acacia as crosslinkers. The particle sizes of NPs appeared to be dependent on the molecular weight of chitosan and concentration of both chitosan and crosslinkers. CT NPs were positively charged as demonstrated by their Zeta potential values. The formation of the nanoparticles was confirmed by FTIR and DSC. Both SEM and TEM micrographs showed that both CT-Acacia and CT:TPP NPs were smooth, spherical in shape and are distributed uniformly with a size range of 200nm to 300 nm. The CT:TPP NPs retained an average of 98% of the added water over a 48-hour period. CT-Acacia NPs showed high moisture absorption but lower moisture retention capacity, which indicates their competency to entrap polar actives in cosmetics and release the encapsulated actives in low polarity skin conditions. The cytotoxicity studies using MTT assay showed that CT NPs made using TPP or Acacia crosslinkers were similarly non-toxic to the human dermal fibroblast cells. Cellular uptake study of NPs observed using live-cell imaging microscopy, proving the great cellular internalisation of CT:TPP NPs and CT-Acacia NPs. Confocal laser scanning microscopy revealed that CT NPs of particle size 530nm containing fluorescein sodium salt as a marker were able to penetrate through the pig skin and gather in the dermis layer. These results show that CT NPs have the ability to deliver the actives and cosmetic components through the skin and to be used as cosmetics and dermal drug delivery system.
DOI: 10.1093/cvr/cvab169
2021
Cited 41 times
How machine learning is impacting research in atrial fibrillation: implications for risk prediction and future management
There has been an exponential growth of artificial intelligence (AI) and machine learning (ML) publications aimed at advancing our understanding of atrial fibrillation (AF), which has been mainly driven by the confluence of two factors: the advances in deep neural networks (DeepNNs) and the availability of large, open access databases. It is observed that most of the attention has centred on applying ML for dvsetecting AF, particularly using electrocardiograms (ECGs) as the main data modality. Nearly a third of them used DeepNNs to minimize or eliminate the need for transforming the ECGs to extract features prior to ML modelling; however, we did not observe a significant advantage in following this approach. We also found a fraction of studies using other data modalities, and others centred in aims, such as risk prediction, AF management, and others. From the clinical perspective, AI/ML can help expand the utility of AF detection and risk prediction, especially for patients with additional comorbidities. The use of AI/ML for detection and risk prediction into applications and smart mobile health (mHealth) technology would enable 'real time' dynamic assessments. AI/ML could also adapt to treatment changes over time, as well as incident risk factors. Incorporation of a dynamic AI/ML model into mHealth technology would facilitate 'real time' assessment of stroke risk, facilitating mitigation of modifiable risk factors (e.g. blood pressure control). Overall, this would lead to an improvement in clinical care for patients with AF.
DOI: 10.1093/ehjqcco/qcw025
2016
Cited 71 times
Impact of co-morbid burden on mortality in patients with coronary heart disease, heart failure, and cerebrovascular accident: a systematic review and meta-analysis
We sought to investigate the prognostic impact of co-morbid burden as defined by the Charlson Co-morbidity Index (CCI) in patients with a range of prevalent cardiovascular diseases.We searched MEDLINE and EMBASE to identify studies that evaluated the impact of CCI on mortality in patients with cardiovascular disease. A random-effects meta-analysis was undertaken to evaluate the impact of CCI on mortality in patients with coronary heart disease (CHD), heart failure (HF), and cerebrovascular accident (CVA). A total of 11 studies of acute coronary syndrome (ACS), 2 stable coronary disease, 5 percutaneous coronary intervention (PCI), 13 HF, and 4 CVA met the inclusion criteria. An increase in CCI score per point was significantly associated with a greater risk of mortality in patients with ACS [pooled relative risk ratio (RR) 1.33; 95% CI 1.15-1.54], PCI (RR 1.21; 95% CI 1.12-1.31), stable coronary artery disease (RR 1.38; 95% CI 1.29-1.48), and HF (RR 1.21; 95% CI 1.13-1.29), but not CVA. A CCI score of >2 significantly increased the risk of mortality in ACS (RR 2.52; 95% CI 1.58-4.04), PCI (RR 3.36; 95% CI 2.14-5.29), HF (RR 1.76; 95% CI 1.65-1.87), and CVA (RR 3.80; 95% CI 1.20-12.01).Increasing co-morbid burden as defined by CCI is associated with a significant increase in risk of mortality in patients with underlying CHD, HF, and CVA. CCI provides a simple way of predicting adverse outcomes in patients with cardiovascular disease and should be incorporated into decision-making processes when counselling patients.
DOI: 10.1007/s10994-017-5685-x
2017
Cited 65 times
Meta-QSAR: a large-scale application of meta-learning to drug design and discovery
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 3 molecular representations, applied to more than 2700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.
DOI: 10.1136/heartjnl-2017-312366
2018
Cited 51 times
Association of different antiplatelet therapies with mortality after primary percutaneous coronary intervention
Prasugrel and ticagrelor both reduce ischaemic endpoints in high-risk acute coronary syndromes, compared with clopidogrel. However, comparative outcomes of these two newer drugs in the context of primary percutaneous coronary intervention (PCI) for ST-elevation myocardial infarction (STEMI) remains unclear. We sought to examine this question using the British Cardiovascular Interventional Society national database in patients undergoing primary PCI for STEMI.Data from January 2007 to December 2014 were used to compare use of P2Y12 antiplatelet drugs in primary PCI in >89 000 patients. Statistical modelling, involving propensity matching, multivariate logistic regression (MLR) and proportional hazards modelling, was used to study the association of different antiplatelet drug use with all-cause mortality.In our main MLR analysis, prasugrel was associated with significantly lower mortality than clopidogrel at both 30 days (OR 0.87, 95% CI 0.78 to 0.97, P=0.014) and 1 year (OR 0.89, 95% CI 0.82 to 0.97, P=0.011) post PCI. Ticagrelor was not associated with any significant differences in mortality compared with clopidogrel at either 30 days (OR 1.07, 95% CI 0.95 to 1.21, P=0.237) or 1 year (OR 1.058, 95% CI 0.96 to 1.16, P=0.247). Finally, ticagrelor was associated with significantly higher mortality than prasugrel at both time points (30 days OR 1.22, 95% CI 1.03 to 1.44, P=0.020; 1 year OR 1.19 95% CI 1.04 to 1.35, P=0.01).In a cohort of over 89 000 patients undergoing primary PCI for STEMI in the UK, prasugrel is associated with a lower 30-day and 1-year mortality than clopidogrel and ticagrelor. Given that an adequately powered comparative randomised trial is unlikely to be performed, these data may have implications for routine care.
DOI: 10.1128/ec.05029-11
2011
Cited 55 times
A Genomewide Screen for Tolerance to Cationic Drugs Reveals Genes Important for Potassium Homeostasis in Saccharomyces cerevisiae
ABSTRACT Potassium homeostasis is crucial for living cells. In the yeast Saccharomyces cerevisiae , the uptake of potassium is driven by the electrochemical gradient generated by the Pma1 H + -ATPase, and this process represents a major consumer of the gradient. We considered that any mutation resulting in an alteration of the electrochemical gradient could give rise to anomalous sensitivity to any cationic drug independently of its toxicity mechanism. Here, we describe a genomewide screen for mutants that present altered tolerance to hygromycin B, spermine, and tetramethylammonium. Two hundred twenty-six mutant strains displayed altered tolerance to all three drugs (202 hypersensitive and 24 hypertolerant), and more than 50% presented a strong or moderate growth defect at a limiting potassium concentration (1 mM). Functional groups such as protein kinases and phosphatases, intracellular trafficking, transcription, or cell cycle and DNA processing were enriched. Essentially, our screen has identified a substantial number of genes that were not previously described to play a direct or indirect role in potassium homeostasis. A subset of 27 representative mutants were selected and subjected to diverse biochemical tests that, in some cases, allowed us to postulate the basis for the observed phenotypes.
DOI: 10.1136/bmjopen-2015-008650
2015
Cited 48 times
Primary care consultation rates among people with and without severe mental illness: a UK cohort study using the Clinical Practice Research Datalink
Little is known about service utilisation by patients with severe mental illness (SMI) in UK primary care. We examined their consultation rate patterns and whether they were impacted by the introduction of the Quality and Outcomes Framework (QOF), in 2004.Retrospective cohort study using individual patient data collected from 2000 to 2012.627 general practices contributing to the Clinical Practice Research Datalink, a large UK primary care database.SMI cases (346,551) matched to 5 individuals without SMI (1,732,755) on age, gender and general practice.Consultation rates were calculated for both groups, across 3 types: face-to-face (primary outcome), telephone and other (not only consultations but including administrative tasks). Poisson regression analyses were used to identify predictors of consultation rates and calculate adjusted consultation rates. Interrupted time-series analysis was used to quantify the effect of the QOF.Over the study period, face-to-face consultations in primary care remained relatively stable in the matched control group (between 4.5 and 4.9 per annum) but increased for people with SMI (8.8-10.9). Women and older patients consulted more frequently in the SMI and the matched control groups, across all 3 consultation types. Following the introduction of the QOF, there was an increase in the annual trend of face-to-face consultation for people with SMI (average increase of 0.19 consultations per patient per year, 95% CI 0.02 to 0.36), which was not observed for the control group (estimates across groups statistically different, p=0.022).The introduction of the QOF was associated with increases in the frequency of monitoring and in the average number of reported comorbidities for patients with SMI. This suggests that the QOF scheme successfully incentivised practices to improve their monitoring of the mental and physical health of this group of patients.
DOI: 10.1093/eurjpc/zwae008
2024
Unlocking the potential of artificial intelligence in sports cardiology: does it have a role in evaluating athlete’s heart?
The integration of artificial intelligence (AI) technologies is evolving in different fields of cardiology and in particular in sports cardiology. Artificial intelligence offers significant opportunities to enhance risk assessment, diagnosis, treatment planning, and monitoring of athletes. This article explores the application of AI in various aspects of sports cardiology, including imaging techniques, genetic testing, and wearable devices. The use of machine learning and deep neural networks enables improved analysis and interpretation of complex datasets. However, ethical and legal dilemmas must be addressed, including informed consent, algorithmic fairness, data privacy, and intellectual property issues. The integration of AI technologies should complement the expertise of physicians, allowing for a balanced approach that optimizes patient care and outcomes. Ongoing research and collaborations are vital to harness the full potential of AI in sports cardiology and advance our management of cardiovascular health in athletes.
DOI: 10.1186/1471-2105-11-581
2010
Cited 48 times
The INTERPRET Decision-Support System version 3.0 for evaluation of Magnetic Resonance Spectroscopy data from human brain tumours and other abnormal brain masses
Proton Magnetic Resonance (MR) Spectroscopy (MRS) is a widely available technique for those clinical centres equipped with MR scanners. Unlike the rest of MR-based techniques, MRS yields not images but spectra of metabolites in the tissues. In pathological situations, the MRS profile changes and this has been particularly described for brain tumours. However, radiologists are frequently not familiar to the interpretation of MRS data and for this reason, the usefulness of decision-support systems (DSS) in MRS data analysis has been explored.This work presents the INTERPRET DSS version 3.0, analysing the improvements made from its first release in 2002. Version 3.0 is aimed to be a program that 1st, can be easily used with any new case from any MR scanner manufacturer and 2nd, improves the initial analysis capabilities of the first version. The main improvements are an embedded database, user accounts, more diagnostic discrimination capabilities and the possibility to analyse data acquired under additional data acquisition conditions. Other improvements include a customisable graphical user interface (GUI). Most diagnostic problems included have been addressed through a pattern-recognition based approach, in which classifiers based on linear discriminant analysis (LDA) were trained and tested.The INTERPRET DSS 3.0 allows radiologists, medical physicists, biochemists or, generally speaking, any person with a minimum knowledge of what an MR spectrum is, to enter their own SV raw data, acquired at 1.5 T, and to analyse them. The system is expected to help in the categorisation of MR Spectra from abnormal brain masses.
DOI: 10.3390/a16040181
2023
Cited 4 times
How to Open a Black Box Classifier for Tabular Data
A lack of transparency in machine learning models can limit their application. We show that analysis of variance (ANOVA) methods extract interpretable predictive models from them. This is possible because ANOVA decompositions represent multivariate functions as sums of functions of fewer variables. Retaining the terms in the ANOVA summation involving functions of only one or two variables provides an efficient method to open black box classifiers. The proposed method builds generalised additive models (GAMs) by application of L1 regularised logistic regression to the component terms retained from the ANOVA decomposition of the logit function. The resulting GAMs are derived using two alternative measures, Dirac and Lebesgue. Both measures produce functions that are smooth and consistent. The term partial responses in structured models (PRiSM) describes the family of models that are derived from black box classifiers by application of ANOVA decompositions. We demonstrate their interpretability and performance for the multilayer perceptron, support vector machines and gradient-boosting machines applied to synthetic data and several real-world data sets, namely Pima Diabetes, German Credit Card, and Statlog Shuttle from the UCI repository. The GAMs are shown to be compliant with the basic principles of a formal framework for interpretability.
DOI: 10.1093/ehjqcco/qcad030
2023
Cited 4 times
Relationship between remnant cholesterol and risk of heart failure in participants with diabetes mellitus
Abstract Background Evidence about the association between calculated remnant cholesterol (RC) and risk of heart failure (HF) in participants with diabetes mellitus (DM) remains sparse and limited. Methods We included a total of 22 230 participants with DM from the UK Biobank for analyses. Participants were categorized into three groups based on their baseline RC measures: low (with a mean RC of 0.41 mmol/L), moderate (0.66 mmol/L), and high (1.04 mmol/L). Cox proportional hazards models were used to evaluate the relationship between RC groups and HF risk. We performed discordance analysis to evaluate whether RC was associated with HF risk independently of low-density lipoprotein cholesterol (LDL-C). Results During a mean follow-up period of 11.5 years, there were a total of 2232 HF events observed. The moderate RC group was significantly related with a 15% increased risk of HF when compared with low RC group (hazard ratio [HR] = 1.15, 95% confidence interval [CI]: 1.01—1.32), while the high RC group with a 23% higher HF risk (HR = 1.23, 95% CI: 1.05–1.43). There was significant relationship between RC as a continuous measure and the increased HF risk (P < 0.01). The association between RC and risk of HF was stronger in participants with HbA1c level ≥ 53 mmol/mol when compared with HbA1c < 53 mmol/mol (P for interaction = 0.02). Results from discordance analyses showed that RC was significantly related to HF risk independent of LDL-C measures. Conclusions Elevated RC was significantly associated with risk of HF in patients with DM. Moreover, RC was significantly related to HF risk independent of LDL-C measures. These findings may highlight the importance of RC management to HF risk in patients with DM.
DOI: 10.1136/bmjopen-2014-004952
2014
Cited 30 times
Can analyses of electronic patient records be independently and externally validated? The effect of statins on the mortality of patients with ischaemic heart disease: a cohort study with nested case–control analysis
To conduct a fully independent and external validation of a research study based on one electronic health record database, using a different electronic database sampling the same population.Using the Clinical Practice Research Datalink (CPRD), we replicated a published investigation into the effects of statins in patients with ischaemic heart disease (IHD) by a different research team using QResearch. We replicated the original methods and analysed all-cause mortality using: (1) a cohort analysis and (2) a case-control analysis nested within the full cohort.Electronic health record databases containing longitudinal patient consultation data from large numbers of general practices distributed throughout the UK.CPRD data for 34 925 patients with IHD from 224 general practices, compared to previously published results from QResearch for 13 029 patients from 89 general practices. The study period was from January 1996 to December 2003.We successfully replicated the methods of the original study very closely. In a cohort analysis, risk of death was lower by 55% for patients on statins, compared with 53% for QResearch (adjusted HR 0.45, 95% CI 0.40 to 0.50; vs 0.47, 95% CI 0.41 to 0.53). In case-control analyses, patients on statins had a 31% lower odds of death, compared with 39% for QResearch (adjusted OR 0.69, 95% CI 0.63 to 0.75; vs OR 0.61, 95% CI 0.52 to 0.72). Results were also close for individual statins.Database differences in population characteristics and in data definitions, recording, quality and completeness had a minimal impact on key statistical outputs. The results uphold the validity of research using CPRD and QResearch by providing independent evidence that both datasets produce very similar estimates of treatment effect, leading to the same clinical and policy decisions. Together with other non-independent replication studies, there is a nascent body of evidence for wider validity.
DOI: 10.3389/fmed.2022.915224
2022
Cited 10 times
In-Hospital Mortality of Sepsis Differs Depending on the Origin of Infection: An Investigation of Predisposing Factors
Sepsis is a heterogeneous syndrome characterized by a variety of clinical features. Analysis of large clinical datasets may serve to define groups of sepsis with different risks of adverse outcomes. Clinical experience supports the concept that prognosis, treatment, severity, and time course of sepsis vary depending on the source of infection. We analyzed a large publicly available database to test this hypothesis. In addition, we developed prognostic models for the three main types of sepsis: pulmonary, urinary, and abdominal sepsis. We used logistic regression using routinely available clinical data for mortality prediction in each of these groups. The data was extracted from the eICU collaborative research database, a multi-center intensive care unit with over 200,000 admissions. Sepsis cohorts were defined using admission diagnosis codes. We used univariate and multivariate analyses to establish factors relevant for outcome prediction in all three cohorts of sepsis (pulmonary, urinary and abdominal). For logistic regression, input variables were automatically selected using a sequential forward search algorithm over 10 dataset instances. Receiver operator characteristics were generated for each model and compared with established prognostication tools (APACHE IV and SOFA). A total of 3,958 sepsis admissions were included in the analysis. Sepsis in-hospital mortality differed depending on the cause of infection: abdominal 18.93%, pulmonary 19.27%, and renal 12.81%. Higher average heart rate was associated with increased mortality risk. Increased average Mean Arterial Pressure (MAP) showed a reduced mortality risk across all sepsis groups. Results from the LR models found significant factors that were relevant for specific sepsis groups. Our models outperformed APACHE IV and SOFA scores with AUC between 0.63 and 0.74. Predictive power decreased over time, with the best results achieved for data extracted for the first 24 h of admission. Mortality varied significantly between the three sepsis groups. We also demonstrate that factors of importance show considerable heterogeneity depending on the source of infection. The factors influencing in-hospital mortality vary depending on the source of sepsis which may explain why most sepsis trials have failed to identify an effective treatment. The source of infection should be considered when considering mortality risk. Planning of sepsis treatment trials may benefit from risk stratification based on the source of infection.
DOI: 10.1016/j.neunet.2008.05.013
2008
Cited 36 times
Advances in clustering and visualization of time series using GTM through time
Most of the existing research on multivariate time series concerns supervised forecasting problems. In comparison, little research has been devoted to their exploration through unsupervised clustering and visualization. In this paper, the capabilities of Generative Topographic Mapping Through Time, a model with foundations in probability theory, that performs simultaneous time series clustering and visualization, are assessed in detail. Focus is placed on the visualization of the evolution of signal regimes and the exploration of sudden transitions, for which a novel identification index is defined. The interpretability of time series clustering results may become extremely difficult, even in exploratory visualization, for high dimensional datasets. Here, we define and test an unsupervised time series relevance determination method, fully integrated in the Generative Topographic Mapping Through Time model, that can be used as a basis for time series selection. This method should ease the interpretation of time series clustering results.
DOI: 10.1016/j.ijcard.2016.11.050
2017
Cited 25 times
Review of early hospitalisation after percutaneous coronary intervention
Percutaneous coronary intervention (PCI) is the most common modality of revascularization in patients with coronary artery disease. Understanding the readmission rates and reasons for readmission after PCI is important because readmissions are a quality of care indicator, in addition to being a burden to patients and healthcare services.A literature review was performed. Relevant studies are described by narrative synthesis with the use of tables to summarize study results.Data suggests that 30-day readmissions are not uncommon. The rate of readmission after PCI is highly influenced by the cohort and the healthcare system studied, with 30-day readmission rates reported to be between 4.7-% and 15.6%. Studies consistently report that a majority of readmissions within 30days are due to a cardiac-related disorders or complication-related disorders. Female sex, peripheral vascular disease, diabetes mellitus, renal failure and non-elective PCI are predictive of readmission. Studies also suggest that there is greater risk of mortality among patients who are readmitted compared to those who are not readmitted.Readmission after PCI is common and its rate is highly influenced by the type of cohort studied. There is clear evidence that majority of readmissions within 30days are cardiac related. While there are many predictors of readmission following PCI, it is not known whether targeting patients with modifiable predictors could prevent or reduce the rates of readmission.
DOI: 10.1093/ehjqcco/qcad038
2023
Cited 3 times
Relationship between remnant cholesterol and risk of heart failure in participants with diabetes mellitus: Reply
Journal Article Relationship between remnant cholesterol and risk of heart failure in participants with diabetes mellitus: Reply Get access Ruoting Wang, Ruoting Wang Center for Clinical Epidemiology and Methodology (CCEM), Guangdong Second Provincial General Hospital, Guangzhou, China Search for other works by this author on: Oxford Academic Google Scholar Hertzel C Gerstein, Hertzel C Gerstein Department of Medicine, McMaster University, Hamilton, ON, CanadaPopulation Health Research Institute, McMaster University, Hamilton, ONCanada https://orcid.org/0000-0001-8072-2836 Search for other works by this author on: Oxford Academic Google Scholar Harriette G C Van Spall, Harriette G C Van Spall Department of Medicine, McMaster University, Hamilton, ON, CanadaPopulation Health Research Institute, McMaster University, Hamilton, ONCanada https://orcid.org/0000-0002-8370-4569 Search for other works by this author on: Oxford Academic Google Scholar Gregory Y H Lip, Gregory Y H Lip Liverpool Centre for Cardiovascular Science at University of Liverpool, Liverpool John Moores University and Liverpool Heart & Chest Hospital, Liverpool, United KingdomDepartment of Clinical Medicine, Aalborg University, Aalborg, Denmark https://orcid.org/0000-0002-7566-1626 Search for other works by this author on: Oxford Academic Google Scholar Ivan Olier, Ivan Olier Liverpool Centre for Cardiovascular Science at University of Liverpool, Liverpool John Moores University and Liverpool Heart & Chest Hospital, Liverpool, United KingdomSchool of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom https://orcid.org/0000-0002-5679-7501 Search for other works by this author on: Oxford Academic Google Scholar Sandra Ortega-Martorell, Sandra Ortega-Martorell Liverpool Centre for Cardiovascular Science at University of Liverpool, Liverpool John Moores University and Liverpool Heart & Chest Hospital, Liverpool, United KingdomSchool of Computer Science and Mathematics, Liverpool John Moores University, Liverpool, United Kingdom https://orcid.org/0000-0001-9927-3209 Search for other works by this author on: Oxford Academic Google Scholar Lehana Thabane, Lehana Thabane Father Sean O'Sullivan Research Centre, St. Joseph's Healthcare Hamilton, Hamilton, ON, CanadaDepartment of Health Research Methods, Evidence, and Impact (HEI), McMaster University, Hamilton, ON, Canada https://orcid.org/0000-0003-0355-9734 Search for other works by this author on: Oxford Academic Google Scholar Zebing Ye, Zebing Ye Department of Cardiology, Guangdong Second Provincial General Hospital, Guangzhou, China Search for other works by this author on: Oxford Academic Google Scholar Guowei Li Guowei Li Center for Clinical Epidemiology and Methodology (CCEM), Guangdong Second Provincial General Hospital, Guangzhou, ChinaFather Sean O'Sullivan Research Centre, St. Joseph's Healthcare Hamilton, Hamilton, ON, CanadaDepartment of Health Research Methods, Evidence, and Impact (HEI), McMaster University, Hamilton, ON, Canada Corresponding author. Tel: 86-020-32640264; Fax: 86-020-89169025, Email: lig28@mcmaster.ca Search for other works by this author on: Oxford Academic Google Scholar European Heart Journal - Quality of Care and Clinical Outcomes, Volume 9, Issue 5, August 2023, Page 547, https://doi.org/10.1093/ehjqcco/qcad038 Published: 12 July 2023 Article history Received: 26 June 2023 Accepted: 29 June 2023 Corrected and typeset: 12 July 2023 Published: 12 July 2023
DOI: 10.1186/1471-2105-11-106
2010
Cited 31 times
SpectraClassifier 1.0: a user friendly, automated MRS-based classifier-development system
SpectraClassifier (SC) is a Java solution for designing and implementing Magnetic Resonance Spectroscopy (MRS)-based classifiers. The main goal of SC is to allow users with minimum background knowledge of multivariate statistics to perform a fully automated pattern recognition analysis. SC incorporates feature selection (greedy stepwise approach, either forward or backward), and feature extraction (PCA). Fisher Linear Discriminant Analysis is the method of choice for classification. Classifier evaluation is performed through various methods: display of the confusion matrix of the training and testing datasets; K-fold cross-validation, leave-one-out and bootstrapping as well as Receiver Operating Characteristic (ROC) curves.SC is composed of the following modules: Classifier design, Data exploration, Data visualisation, Classifier evaluation, Reports, and Classifier history. It is able to read low resolution in-vivo MRS (single-voxel and multi-voxel) and high resolution tissue MRS (HRMAS), processed with existing tools (jMRUI, INTERPRET, 3DiCSI or TopSpin). In addition, to facilitate exchanging data between applications, a standard format capable of storing all the information needed for a dataset was developed. Each functionality of SC has been specifically validated with real data with the purpose of bug-testing and methods validation. Data from the INTERPRET project was used.SC is a user-friendly software designed to fulfil the needs of potential users in the MRS community. It accepts all kinds of pre-processed MRS data types and classifies them semi-automatically, allowing spectroscopists to concentrate on interpretation of results with the use of its visualisation tools.
DOI: 10.1016/j.neuroimage.2013.04.046
2013
Cited 24 times
A switching multi-scale dynamical network model of EEG/MEG
We introduce a new generative model of the Encephalography (EEG/MEG) data, the inversion of which allows for inferring the locations and temporal evolution of the underlying sources as well as their dynamical interactions. The proposed Switching Mesostate Space Model (SMSM) builds on the multi-scale generative model for EEG/MEG by Daunizeau and Friston (2007). SMSM inherits the assumptions that (1) bioelectromagnetic activity is generated by a set of distributed sources, (2) the dynamics of these sources can be modelled as random fluctuations about a small number of mesostates, and (3) the number of mesostates engaged by a cognitive task is small. Additionally, four generalising assumptions are now included: (4) the mesostates interact according to a full Dynamical Causal Network (DCN) that can be estimated; (5) the dynamics of the mesostates can switch between multiple approximately linear operating regimes; (6) each operating regime remains stable over finite periods of time (temporal clusters); and (7) the total number of times the mesostates' dynamics can switch is small. The proposed model adds, therefore, a level of flexibility by accommodating complex brain processes that cannot be characterised by purely linear and stationary Gaussian dynamics. Importantly, the SMSM furnishes a new interpretation of the EEG/MEG data in which the source activity may have multiple discrete modes of behaviour, each with approximately linear dynamics. This is modelled by assuming that the connection strengths of the underlying mesoscopic DCN are time-dependent but piecewise constant, i.e. they can undergo discrete changes over time. A Variational Bayes inversion scheme is derived to estimate all the parameters of the model by maximising a (Negative Free Energy) lower bound on the model evidence. This bound is used to select among different model choices that are defined by the number of mesostates as well as by the number of stationary linear regimes. The full model is compared to a simplified version that uses no dynamical assumptions as well as to a standard EEG inversion technique. The comparison is carried out using an extensive set of simulations, and the application of SMSM to a real data set is also demonstrated. Our results show that for experimental situations in which we have some a priori belief that there are multiple approximately linear dynamical regimes, the proposed SMSM provides a natural modelling tool.
DOI: 10.1002/nbm.3521
2016
Cited 19 times
MRSI-based molecular imaging of therapy response to temozolomide in preclinical glioblastoma using source analysis
Characterization of glioblastoma (GB) response to treatment is a key factor for improving patients' survival and prognosis. MRI and magnetic resonance spectroscopic imaging (MRSI) provide morphologic and metabolic profiles of GB but usually fail to produce unequivocal biomarkers of response. The purpose of this work is to provide proof of concept of the ability of a semi-supervised signal source extraction methodology to produce images with robust recognition of response to temozolomide (TMZ) in a preclinical GB model. A total of 38 female C57BL/6 mice were used in this study. The semi-supervised methodology extracted the required sources from a training set consisting of MRSI grids from eight GL261 GBs treated with TMZ, and six control untreated GBs. Three different sources (normal brain parenchyma, actively proliferating GB and GB responding to treatment) were extracted and used for calculating nosologic maps representing the spatial response to treatment. These results were validated with an independent test set (7 control and 17 treated cases) and correlated with histopathology. Major differences between the responder and non-responder sources were mainly related to the resonances of mobile lipids (MLs) and polyunsaturated fatty acids in MLs (0.9, 1.3 and 2.8 ppm). Responding tumors showed significantly lower mitotic (3.3 ± 2.9 versus 14.1 ± 4.2 mitoses/field) and proliferation rates (29.8 ± 10.3 versus 57.8 ± 5.4%) than control untreated cases. The methodology described in this work is able to produce nosological images of response to TMZ in GL261 preclinical GBs and suitably correlates with the histopathological analysis of tumors. A similar strategy could be devised for monitoring response to treatment in patients. Copyright © 2016 John Wiley & Sons, Ltd.
DOI: 10.48550/arxiv.2405.00102
2024
Applying machine learning to Galactic Archaeology: how well can we recover the origin of stars in Milky Way-like galaxies?
We present several machine learning (ML) models developed to efficiently separate stars formed in-situ in Milky Way-type galaxies from those that were formed externally and later accreted. These models, which include examples from artificial neural networks, decision trees and dimensionality reduction techniques, are trained on a sample of disc-like, Milky Way-mass galaxies drawn from the ARTEMIS cosmological hydrodynamical zoom-in simulations. We find that the input parameters which provide an optimal performance for these models consist of a combination of stellar positions, kinematics, chemical abundances ([Fe/H] and [$\alpha$/Fe]) and photometric properties. Models from all categories perform similarly well, with area under the precision-recall curve (PR-AUC) scores of $\simeq 0.6$. Beyond a galactocentric radius of $5$~kpc, models retrieve $>90\%$ of accreted stars, with a sample purity close to $60\%$, however the purity can be increased by adjusting the classification threshold. For one model, we also include host galaxy-specific properties in the training, to account for the variability of accretion histories of the hosts, however this does not lead to an improvement in performance. The ML models can identify accreted stars even in regions heavily dominated by the in-situ component (e.g., in the disc), and perform well on an unseen suite of simulations (the Auriga simulations). The general applicability bodes well for application of such methods on observational data to identify accreted substructures in the Milky Way without the need to resort to selection cuts for minimising the contamination from in-situ stars.
DOI: 10.1371/journal.pone.0146715
2016
Cited 18 times
Modelling Conditions and Health Care Processes in Electronic Health Records: An Application to Severe Mental Illness with the Clinical Practice Research Datalink
The use of Electronic Health Records databases for medical research has become mainstream. In the UK, increasing use of Primary Care Databases is largely driven by almost complete computerisation and uniform standards within the National Health Service. Electronic Health Records research often begins with the development of a list of clinical codes with which to identify cases with a specific condition. We present a methodology and accompanying Stata and R commands (pcdsearch/Rpcdsearch) to help researchers in this task. We present severe mental illness as an example.We used the Clinical Practice Research Datalink, a UK Primary Care Database in which clinical information is largely organised using Read codes, a hierarchical clinical coding system. Pcdsearch is used to identify potentially relevant clinical codes and/or product codes from word-stubs and code-stubs suggested by clinicians. The returned code-lists are reviewed and codes relevant to the condition of interest are selected. The final code-list is then used to identify patients.We identified 270 Read codes linked to SMI and used them to identify cases in the database. We observed that our approach identified cases that would have been missed with a simpler approach using SMI registers defined within the UK Quality and Outcomes Framework.We described a framework for researchers of Electronic Health Records databases, for identifying patients with a particular condition or matching certain clinical criteria. The method is invariant to coding system or database and can be used with SNOMED CT, ICD or other medical classification code-lists.
DOI: 10.1371/journal.pone.0171784
2017
Cited 18 times
rEHR: An R package for manipulating and analysing Electronic Health Record data
Research with structured Electronic Health Records (EHRs) is expanding as data becomes more accessible; analytic methods advance; and the scientific validity of such studies is increasingly accepted. However, data science methodology to enable the rapid searching/extraction, cleaning and analysis of these large, often complex, datasets is less well developed. In addition, commonly used software is inadequate, resulting in bottlenecks in research workflows and in obstacles to increased transparency and reproducibility of the research. Preparing a research-ready dataset from EHRs is a complex and time consuming task requiring substantial data science skills, even for simple designs. In addition, certain aspects of the workflow are computationally intensive, for example extraction of longitudinal data and matching controls to a large cohort, which may take days or even weeks to run using standard software. The rEHR package simplifies and accelerates the process of extracting ready-for-analysis datasets from EHR databases. It has a simple import function to a database backend that greatly accelerates data access times. A set of generic query functions allow users to extract data efficiently without needing detailed knowledge of SQL queries. Longitudinal data extractions can also be made in a single command, making use of parallel processing. The package also contains functions for cutting data by time-varying covariates, matching controls to cases, unit conversion and construction of clinical code lists. There are also functions to synthesise dummy EHR. The package has been tested with one for the largest primary care EHRs, the Clinical Practice Research Datalink (CPRD), but allows for a common interface to other EHRs. This simplified and accelerated work flow for EHR data extraction results in simpler, cleaner scripts that are more easily debugged, shared and reproduced.
DOI: 10.1016/j.jcin.2017.07.049
2017
Cited 18 times
Impact of Access Site Practice on Clinical Outcomes in Patients Undergoing Percutaneous Coronary Intervention Following Thrombolysis for ST-Segment Elevation Myocardial Infarction in the United Kingdom
This study sought to examine the relationship between access site practice and clinical outcomes in patients requiring percutaneous coronary intervention (PCI) following thrombolysis for ST-segment elevation myocardial infarction (STEMI).Transradial access (TRA) is associated with better outcomes in patients requiring PCI for STEMI. A significant proportion of STEMI patients may receive thrombolysis before undergoing PCI in many countries across the world. There are limited data around access site practice and its associated outcomes in this cohort of patients.The author used the British Cardiovascular Intervention Society dataset to investigate the outcomes of patients undergoing PCI following thrombolysis between 2007 and 2014. Patients were divided into TRA and transfemoral access groups depending on the access site used. Multiple logistic regression and propensity score matching were used to study the association of access site with in-hospital and long-term mortality, major bleeding, and access site-related complications.A total of 10,209 patients received thrombolysis and PCI during the study time. TRA was used in 48% (n = 4,959) of patients; 3.3% (n = 336) patients died in hospital, 1.6% (n = 165) of patients experienced major bleeding, 4.2% (n = 437) experienced major adverse cardiac events (MACE), and 4.6% (n = 468) experienced 30-day mortality. After multivariate adjustment, TRA was associated with significantly reduced odds of in-hospital mortality (odds ratio [OR]: 0.59; 95% confidence interval [CI]: 0.42 to 0.83; p = 0.002), major bleeding (OR: 0.45; 95% CI: 0.31 to 0.66; p < 0.001), MACE (OR: 0.72; 95% CI: 0.55 to 0.94; p = 0.01), and 30-day mortality (OR: 0.72; 95% CI: 0.55 to 0.94; p = 0.01).TRA is associated with decreased odds of bleeding complications, mortality, and MACE in patients undergoing PCI following thrombolysis and should be preferred access site choice in this cohort of patients.
DOI: 10.1186/s13321-019-0392-1
2019
Cited 15 times
Multi-task learning with a natural metric for quantitative structure activity relationship learning
Abstract The goal of quantitative structure activity relationship (QSAR) learning is to learn a function that, given the structure of a small molecule (a potential drug), outputs the predicted activity of the compound. We employed multi-task learning (MTL) to exploit commonalities in drug targets and assays. We used datasets containing curated records about the activity of specific compounds on drug targets provided by ChEMBL. Totally, 1091 assays have been analysed. As a baseline, a single task learning approach that trains random forest to predict drug activity for each drug target individually was considered. We then carried out feature-based and instance-based MTL to predict drug activities. We introduced a natural metric of evolutionary distance between drug targets as a measure of tasks relatedness. Instance-based MTL significantly outperformed both, feature-based MTL and the base learner, on 741 drug targets out of 1091. Feature-based MTL won on 179 occasions and the base learner performed best on 171 drug targets. We conclude that MTL QSAR is improved by incorporating the evolutionary distance between targets. These results indicate that QSAR learning can be performed effectively, even if little data is available for specific drug targets, by leveraging what is known about similar drug targets.
DOI: 10.1073/pnas.2108013118
2021
Cited 12 times
Transformational machine learning: Learning how to learn from many related scientific problems
Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).
DOI: 10.1016/j.ahj.2021.04.001
2021
Cited 11 times
Renin-angiotensin system inhibitors effect before and during hospitalization in COVID-19 outcomes: Final analysis of the international HOPE COVID-19 (Health Outcome Predictive Evaluation for COVID-19) registry
The use of Renin-Angiotensin system inhibitors (RASi) in patients with coronavirus disease 2019 (COVID-19) has been questioned because both share a target receptor site.HOPE-COVID-19 (NCT04334291) is an international investigator-initiated registry. Patients are eligible when discharged after an in-hospital stay with COVID-19, dead or alive. Here, we analyze the impact of previous and continued in-hospital treatment with RASi in all-cause mortality and the development of in-stay complications.We included 6503 patients, over 18 years, from Spain and Italy with data on their RASi status. Of those, 36.8% were receiving any RASi before admission. RASi patients were older, more frequently male, with more comorbidities and frailer. Their probability of death and ICU admission was higher. However, after adjustment, these differences disappeared. Regarding RASi in-hospital use, those who continued the treatment were younger, with balanced comorbidities but with less severe COVID19. Raw mortality and secondary events were less frequent in RASi. After adjustment, patients receiving RASi still presented significantly better outcomes, with less mortality, ICU admissions, respiratory insufficiency, need for mechanical ventilation or prone, sepsis, SIRS and renal failure (p<0.05 for all). However, we did not find differences regarding the hospital use of RASi and the development of heart failure.RASi historic use, at admission, is not related to an adjusted worse prognosis in hospitalized COVID-19 patients, although it points out a high-risk population. In this setting, the in-hospital prescription of RASi is associated with improved survival and fewer short-term complications.
DOI: 10.1186/s12874-023-02058-5
2023
Causal inference and observational data
Abstract Observational studies using causal inference frameworks can provide a feasible alternative to randomized controlled trials. Advances in statistics, machine learning, and access to big data facilitate unraveling complex causal relationships from observational data across healthcare, social sciences, and other fields. However, challenges like evaluating models and bias amplification remain.
DOI: 10.1002/nbm.5095
2024
Early pseudoprogression and progression lesions in glioblastoma patients are both metabolically heterogeneous
The standard treatment in glioblastoma includes maximal safe resection followed by concomitant radiotherapy plus chemotherapy and adjuvant temozolomide. The first follow‐up study to evaluate treatment response is performed 1 month after concomitant treatment, when contrast‐enhancing regions may appear that can correspond to true progression or pseudoprogression. We retrospectively evaluated 31 consecutive patients at the first follow‐up after concomitant treatment to check whether the metabolic pattern assessed with multivoxel MRS was predictive of treatment response 2 months later. We extracted the underlying metabolic patterns of the contrast‐enhancing regions with a blind‐source separation method and mapped them over the reference images. Pattern heterogeneity was calculated using entropy, and association between patterns and outcomes was measured with Cramér's V . We identified three distinct metabolic patterns—proliferative, necrotic, and responsive, which were associated with status 2 months later. Individually, 70% of the patients showed metabolically heterogeneous patterns in the contrast‐enhancing regions. Metabolic heterogeneity was not related to the regions' size and only stable patients were less heterogeneous than the rest. Contrast‐enhancing regions are also metabolically heterogeneous 1 month after concomitant treatment. This could explain the reported difficulty in finding robust pseudoprogression biomarkers.
DOI: 10.1016/j.jelectrocard.2024.03.005
2024
Impact of ECG data format on the performance of machine learning models for the prediction of myocardial infarction
Background We aim to determine which electrocardiogram (ECG) data format is optimal for ML modelling, in the context of myocardial infarction prediction. We will also address the auxiliary objective of evaluating the viability of using digitised ECG signals for ML modelling. Methods Two ECG arrangements displaying 10s and 2.5 s of data for each lead were used. For each arrangement, conservative and speculative data cohorts were generated from the PTB-XL dataset. All ECGs were represented in three different data formats: Signal ECGs, Image ECGs, and Extracted Signal ECGs, with 8358 and 11,621 ECGs in the conservative and speculative cohorts, respectively. ML models were trained using the three data formats in both data cohorts. Results For ECGs that contained 10s of data, Signal and Extracted Signal ECGs were optimal and statistically similar, with AUCs [95% CI] of 0.971 [0.961, 0.981] and 0.974 [0.965, 0.984], respectively, for the conservative cohort; and 0.931 [0.918, 0.945] and 0.919 [0.903, 0.934], respectively, for the speculative cohort. For ECGs that contained 2.5 s of data, the Image ECG format was optimal, with AUCs of 0.960 [0.948, 0.973] and 0.903 [0.886, 0.920], for the conservative and speculative cohorts, respectively. Conclusion When available, the Signal ECG data should be preferred for ML modelling. If not, the optimal format depends on the data arrangement within the ECG: If the Image ECG contains 10s of data for each lead, the Extracted Signal ECG is optimal, however, if it only uses 2.5 s, then using the Image ECG data is optimal for ML performance.
DOI: 10.1186/s12937-024-00925-5
2024
Relationship between trajectories of dietary iron intake and risk of type 2 diabetes mellitus: evidence from a prospective cohort study
The association between dietary iron intake and the risk of type 2 diabetes mellitus (T2DM) remains inconsistent. In this study, we aimed to investigate the relationship between trajectories of dietary iron intake and risk of T2DM.This study comprised a total of 61,115 participants without a prior T2DM from the UK Biobank database. We used the group-based trajectory model (GBTM) to identify different dietary iron intake trajectories. Cox proportional hazards models were used to evaluate the relationship between trajectories of dietary iron intake and risk of T2DM.During a mean follow-up of 4.8 years, a total of 677 T2DM events were observed. Four trajectory groups of dietary iron intake were characterized by the GBTM: trajectory group 1 (with a mean dietary iron intake of 10.9 mg/day), 2 (12.3 mg/day), 3 (14.1 mg/day) and 4 (17.6 mg/day). Trajectory group 3 was significantly associated with a 38% decreased risk of T2DM when compared with trajectory group 1 (hazard ratio [HR] = 0.62, 95% confidence interval [CI]: 0.49-0.79), while group 4 was significantly related with a 30% risk reduction (HR = 0.70, 95% CI: 0.54-0.91). Significant effect modifications by obesity (p = 0.04) and history of cardiovascular disease (p < 0.01) were found to the relationship between trajectories of dietary iron intake and the risk of T2DM.We found that trajectories of dietary iron intake were significantly associated with the risk of T2DM, where the lowest T2DM risk was observed in trajectory group 3 with a mean iron intake of 14.1 mg/day. These findings may highlight the importance of adequate dietary iron intake to the T2DM prevention from a public health perspective. Further studies to assess the relationship between dietary iron intake and risk of T2DM are needed, as well as intervention studies to mitigate the risks of T2DM associated with dietary iron changes.
DOI: 10.1371/journal.pone.0083773
2013
Cited 18 times
A Novel Semi-Supervised Methodology for Extracting Tumor Type-Specific MRS Sources in Human Brain Data
The clinical investigation of human brain tumors often starts with a non-invasive imaging study, providing information about the tumor extent and location, but little insight into the biochemistry of the analyzed tissue. Magnetic Resonance Spectroscopy can complement imaging by supplying a metabolic fingerprint of the tissue. This study analyzes single-voxel magnetic resonance spectra, which represent signal information in the frequency domain. Given that a single voxel may contain a heterogeneous mix of tissues, signal source identification is a relevant challenge for the problem of tumor type classification from the spectroscopic signal.Non-negative matrix factorization techniques have recently shown their potential for the identification of meaningful sources from brain tissue spectroscopy data. In this study, we use a convex variant of these methods that is capable of handling negatively-valued data and generating sources that can be interpreted as tumor class prototypes. A novel approach to convex non-negative matrix factorization is proposed, in which prior knowledge about class information is utilized in model optimization. Class-specific information is integrated into this semi-supervised process by setting the metric of a latent variable space where the matrix factorization is carried out. The reported experimental study comprises 196 cases from different tumor types drawn from two international, multi-center databases. The results indicate that the proposed approach outperforms a purely unsupervised process by achieving near perfect correlation of the extracted sources with the mean spectra of the tumor types. It also improves tissue type classification.We show that source extraction by unsupervised matrix factorization benefits from the integration of the available class information, so operating in a semi-supervised learning manner, for discriminative source identification and brain tumor labeling from single-voxel spectroscopy data. We are confident that the proposed methodology has wider applicability for biomedical signal processing.
DOI: 10.1038/s41598-022-23817-2
2022
Cited 6 times
Enhanced survival prediction using explainable artificial intelligence in heart transplantation
The most limiting factor in heart transplantation is the lack of donor organs. With enhanced prediction of outcome, it may be possible to increase the life-years from the organs that become available. Applications of machine learning to tabular data, typical of clinical decision support, pose the practical question of interpretation, which has technical and potential ethical implications. In particular, there is an issue of principle about the predictability of complex data and whether this is inherent in the data or strongly dependent on the choice of machine learning model, leading to the so-called accuracy-interpretability trade-off. We model 1-year mortality in heart transplantation data with a self-explaining neural network, which is benchmarked against a deep learning model on the same development data, in an external validation study with two data sets: (1) UNOS transplants in 2017-2018 (n = 4750) for which the self-explaining and deep learning models are comparable in their AUROC 0.628 [0.602,0.654] cf. 0.635 [0.609,0.662] and (2) Scandinavian transplants during 1997-2018 (n = 2293), showing good calibration with AUROCs of 0.626 [0.588,0.665] and 0.634 [0.570, 0.698], respectively, with and without missing data (n = 982). This shows that for tabular data, predictive models can be transparent and capture important nonlinearities, retaining full predictive performance.
DOI: 10.1038/s41598-022-17894-6
2022
Cited 5 times
Breast cancer patient characterisation and visualisation using deep learning and fisher information networks
Breast cancer is the most commonly diagnosed female malignancy globally, with better survival rates if diagnosed early. Mammography is the gold standard in screening programmes for breast cancer, but despite technological advances, high error rates are still reported. Machine learning techniques, and in particular deep learning (DL), have been successfully used for breast cancer detection and classification. However, the added complexity that makes DL models so successful reduces their ability to explain which features are relevant to the model, or whether the model is biased. The main aim of this study is to propose a novel visualisation to help characterise breast cancer patients using Fisher Information Networks on features extracted from mammograms using a DL model. In the proposed visualisation, patients are mapped out according to their similarities and can be used to study new patients as a 'patient-like-me' approach. When applied to the CBIS-DDSM dataset, it was shown that it is a competitive methodology that can (i) facilitate the analysis and decision-making process in breast cancer diagnosis with the assistance of the FIN visualisations and 'patient-like-me' analysis, and (ii) help improve diagnostic accuracy and reduce overdiagnosis by identifying the most likely diagnosis based on clinical similarities with neighbouring patients.
DOI: 10.1186/s12933-022-01644-z
2022
Cited 5 times
Association between metabolically healthy obesity and risk of atrial fibrillation: taking physical activity into consideration
Abstract The modification of physical activity (PA) on the metabolic status in relation to atrial fibrillation (AF) in obesity remains unknown. We aimed to investigate the independent and joint associations of metabolic status and PA with the risk of AF in obese population. Based on the data from UK Biobank study, we used Cox proportional hazards models for analyses. Metabolic status was categorized into metabolically healthy obesity (MHO) and metabolically unhealthy obesity (MUO). PA was categorized into four groups according to the level of moderate-to-vigorous PA (MVPA): none, low, medium, and high. A total of 119,424 obese participants were included for analyses. MHO was significantly associated with a 35% reduced AF risk compared with MUO (HR = 0.65, 95% CI: 0.57–0.73). No significant modification of PA on AF risk among individuals with MHO was found. Among the MUO participants, individuals with medium and high PA had significantly lower AF risk compared with no MVPA (HR = 0.84, 95% CI: 0.74–0.95, and HR = 0.87, 95% CI: 0.78–0.96 for medium and high PA, respectively). As the severity of MUO increased, the modification of PA on AF risk was elevated accordingly. To conclude, MHO was significantly associated with a reduced risk of AF when compared with MUO in obese participants. PA could significantly modify the relationship between metabolic status and risk of AF among MUO participants, with particular benefits of PA associated with the reduced AF risk as the MUO severity elevated.
DOI: 10.1007/s10852-008-9088-7
2008
Cited 16 times
Variational Bayesian Generative Topographic Mapping
DOI: 10.1186/s12859-015-0796-5
2015
Cited 9 times
From raw data to data-analysis for magnetic resonance spectroscopy – the missing link: jMRUI2XML
Magnetic resonance spectroscopy provides metabolic information about living tissues in a non-invasive way. However, there are only few multi-centre clinical studies, mostly performed on a single scanner model or data format, as there is no flexible way of documenting and exchanging processed magnetic resonance spectroscopy data in digital format. This is because the DICOM standard for spectroscopy deals with unprocessed data. This paper proposes a plugin tool developed for jMRUI, namely jMRUI2XML, to tackle the latter limitation. jMRUI is a software tool for magnetic resonance spectroscopy data processing that is widely used in the magnetic resonance spectroscopy community and has evolved into a plugin platform allowing for implementation of novel features.jMRUI2XML is a Java solution that facilitates common preprocessing of magnetic resonance spectroscopy data across multiple scanners. Its main characteristics are: 1) it automates magnetic resonance spectroscopy preprocessing, and 2) it can be a platform for outputting exchangeable magnetic resonance spectroscopy data. The plugin works with any kind of data that can be opened by jMRUI and outputs in extensible markup language format. Data processing templates can be generated and saved for later use. The output format opens the way for easy data sharing- due to the documentation of the preprocessing parameters and the intrinsic anonymization--for example for performing pattern recognition analysis on multicentre/multi-manufacturer magnetic resonance spectroscopy data.jMRUI2XML provides a self-contained and self-descriptive format accounting for the most relevant information needed for exchanging magnetic resonance spectroscopy data in digital form, as well as for automating its processing. This allows for tracking the procedures the data has undergone, which makes the proposed tool especially useful when performing pattern recognition analysis. Moreover, this work constitutes a first proposal for a minimum amount of information that should accompany any magnetic resonance processed spectrum, towards the goal of achieving better transferability of magnetic resonance spectroscopy studies.
DOI: 10.3389/fcvm.2022.897709
2022
Cited 4 times
Development of a Risk Prediction Model for New Episodes of Atrial Fibrillation in Medical-Surgical Critically Ill Patients Using the AmsterdamUMCdb
The occurrence of atrial fibrillation (AF) represents clinical deterioration in acutely unwell patients and leads to increased morbidity and mortality. Prediction of the development of AF allows early intervention. Using the AmsterdamUMCdb, clinically relevant variables from patients admitted in sinus rhythm were extracted over the full duration of the ICU stay or until the first recorded AF episode occurred. Multiple logistic regression was performed to identify risk factors for AF. Input variables were automatically selected by a sequential forward search algorithm using cross-validation. We developed three different models: For the overall cohort, for ventilated patients and non-ventilated patients. 16,144 out of 23,106 admissions met the inclusion criteria. 2,374 (12.8%) patients had at least one AF episode during their ICU stay. Univariate analysis revealed that a higher percentage of AF patients were older than 70 years (60% versus 32%) and died in ICU (23.1% versus 7.1%) compared to non-AF patients. Multivariate analysis revealed age to be the dominant risk factor for developing AF with doubling of age leading to a 10-fold increased risk. Our logistic regression models showed excellent performance with AUC.ROC &amp;gt; 0.82 and &amp;gt; 0.91 in ventilated and non-ventilated cohorts, respectively. Increasing age was the dominant risk factor for the development of AF in both ventilated and non-ventilated critically ill patients. In non-ventilated patients, risk for development of AF was significantly higher than in ventilated patients. Further research is warranted to identify the role of ventilatory settings on risk for AF in critical illness and to optimise predictive models.
DOI: 10.3390/jcdd9110382
2022
Cited 4 times
The Athlete’s Heart and Machine Learning: A Review of Current Implementations and Gaps for Future Research
Intense training exercise regimes cause physiological changes within the heart to help cope with the increased stress, known as the "athlete's heart". These changes can mask pathological changes, making them harder to diagnose and increasing the risk of an adverse cardiac outcome.This paper reviews which machine learning techniques (ML) are being used within athlete's heart research and how they are being implemented, as well as assesses the uptake of these techniques within this area of research.Searches were carried out on the Scopus and PubMed online datasets and a scoping review was conducted on the studies which were identified.Twenty-eight studies were included within the review, with ML being directly referenced within 16 (57%). A total of 12 different techniques were used, with the most popular being artificial neural networks and the most common implementation being to perform classification tasks. The review also highlighted the subgroups of interest: predictive modelling, reviews, and wearables, with most of the studies being attributed to the predictive modelling subgroup. The most common type of data used was the electrocardiogram (ECG), with echocardiograms being used the second most often.The results show that over the last 11 years, there has been a growing desire of leveraging ML techniques to help further the understanding of the athlete's heart, whether it be by expanding the knowledge of the physiological changes or by improving the accuracies of models to help improve the treatments and disease management.
DOI: 10.3390/cancers15154002
2023
Tracking Therapy Response in Glioblastoma Using 1D Convolutional Neural Networks
Background: Glioblastoma (GB) is a malignant brain tumour that is challenging to treat, often relapsing even after aggressive therapy. Evaluating therapy response relies on magnetic resonance imaging (MRI) following the Response Assessment in Neuro-Oncology (RANO) criteria. However, early assessment is hindered by phenomena such as pseudoprogression and pseudoresponse. Magnetic resonance spectroscopy (MRS/MRSI) provides metabolomics information but is underutilised due to a lack of familiarity and standardisation. Methods: This study explores the potential of spectroscopic imaging (MRSI) in combination with several machine learning approaches, including one-dimensional convolutional neural networks (1D-CNNs), to improve therapy response assessment. Preclinical GB (GL261-bearing mice) were studied for method optimisation and validation. Results: The proposed 1D-CNN models successfully identify different regions of tumours sampled by MRSI, i.e., normal brain (N), control/unresponsive tumour (T), and tumour responding to treatment (R). Class activation maps using Grad-CAM enabled the study of the key areas relevant to the models, providing model explainability. The generated colour-coded maps showing the N, T and R regions were highly accurate (according to Dice scores) when compared against ground truth and outperformed our previous method. Conclusions: The proposed methodology may provide new and better opportunities for therapy response assessment, potentially providing earlier hints of tumour relapsing stages.
DOI: 10.3389/fmed.2023.1230854
2023
Sepsis-induced coagulopathy is associated with new episodes of atrial fibrillation in patients admitted to critical care in sinus rhythm
Sepsis is a life-threatening disease commonly complicated by activation of coagulation and immune pathways. Sepsis-induced coagulopathy (SIC) is associated with micro- and macrothrombosis, but its relation to other cardiovascular complications remains less clear. In this study we explored associations between SIC and the occurrence of atrial fibrillation (AF) in patients admitted to the Intensive Care Unit (ICU) in sinus rhythm. We also aimed to identify predictive factors for the development of AF in patients with and without SIC.Data were extracted from the publicly available AmsterdamUMCdb database. Patients with sepsis and documented sinus rhythm on admission to ICU were included. Patients were stratified into those who fulfilled the criteria for SIC and those who did not. Following univariate analysis, logistic regression models were developed to describe the association between routinely documented demographics and blood results and the development of at least one episode of AF. Machine learning methods (gradient boosting machines and random forest) were applied to define the predictive importance of factors contributing to the development of AF.Age was the strongest predictor for the development of AF in patients with and without SIC. Routine coagulation tests activated Partial Thromboplastin Time (aPTT) and International Normalized Ratio (INR) and C-reactive protein (CRP) as a marker of inflammation were also associated with AF occurrence in SIC-positive and SIC-negative patients. Cardiorespiratory parameters (oxygen requirements and heart rate) showed predictive potential.Higher INR, elevated CRP, increased heart rate and more severe respiratory failure are risk factors for occurrence of AF in critical illness, suggesting an association between cardiac, respiratory and immune and coagulation pathways. However, age was the most dominant factor to predict the first episodes of AF in patients admitted in sinus rhythm with and without SIC.
DOI: 10.1109/ijcnn.2008.4633841
2008
Cited 12 times
A variational formulation for GTM through time
Generative topographic mapping (GTM) is a latent variable model that, in its original version, was conceived to provide clustering and visualization of multivariate, real-valued, i.i.d. data. It was also extended to deal with noni. i.d. data such as multivariate time series in a variant called GTM through time (GTM-TT), defined as a constrained hidden Markov model (HMM). In this paper, we provide the theoretical foundations of the reformulation of GTM-TT within the variational Bayesian framework and provide an illustrative example of its application. This approach handles the presence of noise in the time series, helping to avert the problem of data overfitting.
2010
Cited 9 times
Kernel Generative Topographic Mapping
A kernel version of Generative Topographic Mapping, a model of the manifold learning family, is defined in this paper. Its ability to adequately model non-i.i.d. data is illustrated in a problem concerning the identification of protein subfamilies from protein sequences.
DOI: 10.1016/j.neucom.2010.12.006
2011
Cited 6 times
A variational Bayesian approach for the robust analysis of the cortical silent period from EMG recordings of brain stroke patients
Transcranial magnetic stimulation (TMS) is a powerful tool for the calculation of parameters related to the intracortical excitability and inhibition of the motor cortex. The cortical silent period (CSP) is one such parameter that corresponds to the suppression of muscle activity for a short period after a muscle response to TMS. The duration of the CSP is known to be correlated with the prognosis of brain stroke patients' motor ability. Current methods for the estimation of the CSP duration are very sensitive to the presence of noise. A variational Bayesian formulation of a manifold-constrained hidden Markov model is applied in this paper to the segmentation of a set of multivariate time series (MTS) of electromyographic recordings corresponding to stroke patients and control subjects. A novel index of variability associated to this model is defined and applied to the detection of the silent period interval of the signal and to the estimation of its duration. This model and its associated index are shown to behave robustly in the presence of noise and provide more reliable estimations than the current standard in clinical practice.
DOI: 10.1007/11494669_96
2005
Cited 10 times
Comparative Assessment of the Robustness of Missing Data Imputation Through Generative Topographic Mapping
The incompleteness of data is a most common source of uncertainty in real-world Data Mining applications. The management of this uncertainty is, therefore, a task of paramount importance for the data analyst. Many methods have been developed for missing data imputation, but few of them approach the problem of imputation as part of a general data density estimation scheme. Amongst the latter, a method for imputing and visualizing multivariate missing data using Generative Topographic Mapping was recently presented. This model and some of its extensions are tested under different experimental conditions. Its performance is compared to that of other missing data imputation techniques, thus assessing its relative capabilities and limitations.
DOI: 10.1007/978-3-642-35686-5_12
2012
Cited 5 times
Complementing Kernel-Based Visualization of Protein Sequences with Their Phylogenetic Tree
The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. This dependency brings about the challenge of finding robust methods to analyze the complex data they generate. In this brief paper, we focus on the analysis of a specific type of proteins, the G protein-couple receptors, which are the target for over 15% of current drugs. We describe a kernel method of the manifold learning family for the analysis and intuitive visualization of their protein amino acid symbolic sequences. This method is shown to reveal the grouping structure of the sequences in a way that closely resembles the corresponding phylogenetic trees.
DOI: 10.1016/j.amjcard.2018.06.016
2018
Cited 5 times
Changes in Periprocedural Bleeding Complications Following Percutaneous Coronary Intervention in The United Kingdom Between 2006 and 2013 (from the British Cardiovascular Interventional Society)
Major bleeding is a common complication after percutaneous coronary intervention (PCI), although little is known about how bleeding rates have changed over time and what has driven this. We analyzed all patients who underwent PCI in England and Wales from 2006 to 2013. Multivariate analyses using logistic regression models were performed to identify predictors of bleeding to identify potential factors influencing bleeding trends over time. 545,604 participants who had PCI in England and Wales between 2006 and 2013 were included in the analyses. Overall bleeding rates decreased from 7.0 (CI 6.2 to 7.8) per 1,000 procedures in 2006 to 5.5 (CI 4.7 to 6.2) per 1,000 in 2013. Increasing age, female sex, GPIIb/IIIa inhibitors use, and circulatory support were independently associated with increased risk of bleeding complications whereas radial access and vascular closure device use were independently associated with decreases in risk. Decreases in bleeding rates over time were associated with radial access site, and changes in pharmacology, but this was offset by greater proportion of ACS cases and the adverse patient clinical demographics. In conclusion, major bleeding complications after PCI have decreased due to changes in access site practice and decreased usage of GPIIb/IIIa inhibitors, but this is offset by the increase of patients with higher propensity to bleed. Changes in access site practice nationally have the potential to significantly reduce major bleeding after PCI.
DOI: 10.1109/iceis.2006.1703217
2006
Cited 7 times
Capturing the Dynamics of Multivariate Time Series Through Visualization Using Generative Topographic Mapping Through Time
Most of the existing research on time series concerns supervised forecasting problems. In comparison, little research has been devoted to unsupervised methods for the visual exploration of multivariate time series. In this paper, the capabilities of the generative topographic mapping through time, a model with solid foundations in probability theory that performs simultaneous time series data clustering and visualization, are assessed in detail in several experiments. The focus is placed on the detection of atypical data, the visualization of the evolution of signal regimes, and the exploration of sudden transitions, for which a novel identification index is defined
DOI: 10.1371/journal.pone.0270652
2022
Externally validated models for first diagnosis and risk of progression of knee osteoarthritis
Objective We develop and externally validate two models for use with radiological knee osteoarthritis. They consist of a diagnostic model for KOA and a prognostic model of time to onset of KOA. Model development and optimisation used data from the Osteoarthritis initiative (OAI) and external validation for both models was by application to data from the Multicenter Osteoarthritis Study (MOST). Materials and methods The diagnostic model at first presentation comprises subjects in the OAI with and without KOA (n = 2006), modelling with multivariate logistic regression. The prognostic sample involves 5-year follow-up of subjects presenting without clinical KOA (n = 1155), with modelling with Cox regression. In both instances the models used training data sets of n = 1353 and 1002 subjects and optimisation used test data sets of n = 1354 and 1003. The external validation data sets for the diagnostic and prognostic models comprised n = 2006 and n = 1155 subjects respectively. Results The classification performance of the diagnostic model on the test data has an AUC of 0.748 (0.721–0.774) and 0.670 (0.631–0.708) in external validation. The survival model has concordance scores for the OAI test set of 0.74 (0.7325–0.7439) and in external validation 0.72 (0.7190–0.7373). The survival approach stratified the population into two risk cohorts. The separation between the cohorts remains when the model is applied to the validation data. Discussion The models produced are interpretable with app interfaces that implement nomograms. The apps may be used for stratification and for patient education over the impact of modifiable risk factors. The externally validated results, by application to data from a substantial prospective observational study, show the robustness of models for likelihood of presenting with KOA at an initial assessment based on risk factors identified by the OAI protocol and stratification of risk for developing KOA in the next five years. Conclusion Modelling clinical KOA from OAI data validates well for the MOST data set. Both risk models identified key factors for differentiation of the target population from commonly available variables. With this analysis there is potential to improve clinical management of patients.
DOI: 10.1093/mnras/stac2760
2022
Machine learning-based search for cataclysmic variables within <i>Gaia</i> Science Alerts
ABSTRACT Wide-field time domain facilities detect transient events in large numbers through difference imaging. For example, Zwicky Transient Facility produces alerts for hundreds of thousands of transient events per night, a rate set to be dwarfed by the upcoming Vera C. Rubin Observatory. The automation provided by machine learning (ML) is therefore necessary to classify these events and select the most interesting sources for follow-up observations. Cataclysmic variables (CVs) are a transient class that are numerous, bright, and nearby, providing excellent laboratories for the study of accretion and binary evolution. Here we focus on our use of ML to identify CVs from photometric data of transient sources published by the Gaia Science Alerts (GSA) program – a large, easily accessible resource, not fully explored with ML. Use of light-curve feature extraction techniques and source metadata from the Gaia survey resulted in a random forest model capable of distinguishing CVs from supernovae, active galactic nuclei, and young stellar objects with a 92 per cent precision score and an 85 per cent hit rate. Of 13 280 sources within GSA without an assigned transient classification our model predicts the CV class for ∼2800. Spectroscopic observations are underway to classify a statistically significant sample of these targets to validate the performance of the model. This work puts us on a path towards the classification of rare CV subtypes from future wide-field surveys such as the Legacy Survey of Space and Time.
DOI: 10.1109/ijcnn.2008.4634005
2008
Cited 4 times
On the benefits for model regularization of a variational formulation of GTM
Generative topographic mapping (GTM) is a manifold learning model for the simultaneous visualization and clustering of multivariate data. It was originally formulated as a constrained mixture of distributions, for which the adaptive parameters were determined by maximum likelihood (ML), using the expectation-maximization (EM) algorithm. In this formulation, GTM is prone to data overfitting unless a regularization mechanism is included. The theoretical principles of variational GTM, an approximate method that provides a full Bayesian treatment to a Gaussian process (GP)-based variation of the GTM, were recently introduced as alternative way to control data overfitting. In this paper we assess in some detail the generalization capabilities of Variational GTM and compare them with those of alternative regularization approaches in terms of test log-likelihood, using several artificial and real datasets.
DOI: 10.1371/journal.pone.0220809
2019
Cited 3 times
Embedding MRI information into MRSI data source extraction improves brain tumour delineation in animal models
Glioblastoma is the most frequent malignant intra-cranial tumour. Magnetic resonance imaging is the modality of choice in diagnosis, aggressiveness assessment, and follow-up. However, there are examples where it lacks diagnostic accuracy. Magnetic resonance spectroscopy enables the identification of molecules present in the tissue, providing a precise metabolomic signature. Previous research shows that combining imaging and spectroscopy information results in more accurate outcomes and superior diagnostic value. This study proposes a method to combine them, which builds upon a previous methodology whose main objective is to guide the extraction of sources. To this aim, prior knowledge about class-specific information is integrated into the methodology by setting the metric of a latent variable space where Non-negative Matrix Factorisation is performed. The former methodology, which only used spectroscopy and involved combining spectra from different subjects, was adapted to use selected areas of interest that arise from segmenting the T2-weighted image. Results showed that embedding imaging information into the source extraction (the proposed semi-supervised analysis) improved the quality of the tumour delineation, as compared to those obtained without this information (unsupervised analysis). Both approaches were applied to pre-clinical data, involving thirteen brain tumour-bearing mice, and tested against histopathological data. On results of twenty-eight images, the proposed Semi-Supervised Source Extraction (SSSE) method greatly outperformed the unsupervised one, as well as an alternative semi-supervised approach from the literature, with differences being statistically significant. SSSE has proven successful in the delineation of the tumour, while bringing benefits such as 1) not constricting the metabolomic-based prediction to the image-segmented area, 2) ability to deal with signal-to-noise issues, 3) opportunity to answer specific questions by allowing researchers/radiologists define areas of interest that guide the source extraction, 4) creation of an intra-subject model and avoiding contamination from inter-subject overlaps, and 5) extraction of meaningful, good-quality sources that adds interpretability, conferring validation and better understanding of each case.
DOI: 10.1007/978-3-030-50146-4_43
2020
Cited 3 times
Explaining the Neural Network: A Case Study to Model the Incidence of Cervical Cancer
Neural networks are frequently applied to medical data. We describe how complex and imbalanced data can be modelled with simple but accurate neural networks that are transparent to the user. In the case of a data set on cervical cancer with 753 observations excluding, missing values, and 32 covariates, with a prevalence of 73 cases (9.69%), we explain how model selection can be applied to the Multi-Layer Perceptron (MLP) by deriving a representation using a General Additive Neural Network. The model achieves an AUROC of 0.621 CI [0.519,0.721] for predicting positive diagnosis with Schiller’s test. This is comparable with the performance obtained by a deep learning network with an AUROC of 0.667 [1]. Instead of using all covariates, the Partial Response Network (PRN) involves just 2 variables, namely the number of years on Hormonal Contraceptives and the number of years using IUD, in a fully explained model. This is consistent with an additive non-linear statistical approach, the Sparse Additive Model [2] which estimates non-linear components in a logistic regression classifier using the backfitting algorithm applied to an ANOVA functional expansion. This paper shows how the PRN, applied to a challenging classification task, can provide insights into the influential variables, in this case correlated with incidence of cervical cancer, so reducing the number of unnecessary variables to be collected for screening. It does so by exploiting the efficiency of sparse statistical models to select features from an ANOVA decomposition of the MLP, in the process deriving a fully interpretable model.
2007
Cited 3 times
A variational Bayesian formulation for GTM: Theoretical foundations
DOI: 10.1007/978-3-319-07695-9_5
2014
Probability Ridges and Distortion Flows: Visualizing Multivariate Time Series Using a Variational Bayesian Manifold Learning Method
Time-dependent natural phenomena and artificial processes can often be quantitatively expressed as multivariate time series (MTS). As in any other process of knowledge extraction from data, the analyst can benefit from the exploration of the characteristics of MTS through data visualization. This visualization often becomes difficult to interpret when MTS are modelled using nonlinear techniques. Despite their flexibility, nonlinear models can be rendered useless if such interpretability is lacking. In this brief paper, we model MTS using Variational Bayesian Generative Topographic Mapping Through Time (VB-GTM-TT), a variational Bayesian variant of a constrained hidden Markov model of the manifold learning family defined for MTS visualization. We aim to increase its interpretability by taking advantage of two results of the probabilistic definition of the model: the explicit estimation of probabilities of transition between states described in the visualization space and the quantification of the nonlinear mapping distortion.
DOI: 10.1109/cidm.2014.7008653
2014
Semi-supervised source extraction methodology for the nosological imaging of glioblastoma response to therapy
Glioblastomas are one the most aggressive brain tumors. Their usual bad prognosis is due to the heterogeneity of their response to treatment and the lack of early and robust biomarkers to decide whether the tumor is responding to therapy. In this work, we propose the use of a semi-supervised methodology for source extraction to identify the sources representing tumor response to therapy, untreated/unresponsive tumor, and normal brain; and create nosological images of the response to therapy based on those sources. Fourteen mice were used to calculate the sources, and an independent test set of eight mice was used to further evaluate the proposed approach. The preliminary results obtained indicate that was possible to discriminate response and untreated/unresponsive areas of the tumor, and that the color-coded images allowed convenient tracking of response, especially throughout the course of therapy.
2011
A probabilistic approach to the visual exploration of G Protein-Coupled Receptor sequences
The study of G protein-coupled receptors (GPCRs) is of great interest in pharmaceutical research, but only a few of their 3D structures are known at present. On the contrary, their amino acid sequences are known and accessible. Sequence analysis can provide new insight on GPCR function. Here, we use a kernel-based statistical machine learning model for the visual exploration of GPCR functional groups from their sequences. This is based on the rich information provided by the model regarding the probability of each sequence belonging to a certain receptor group.
DOI: 10.1016/j.jinf.2023.02.020
2023
The association of epicardial adipose tissue volume and density with coronary calcium in HIV-positive and HIV-negative patients
We sought to assess and compare the association of epicardial adipose tissue (EAT) with cardiovascular disease (CVD) in HIV-positive and HIV-negative groups.Using existing clinical databases, we analyzed 700 patients (195 HIV-positive, 505 HIV-negative). CVD was quantified by the presence of coronary calcification from both dedicated cardiac computed tomography (CT) and non-dedicated CT of the thorax. Epicardial adipose tissue (EAT) was quantified using dedicated software. The HIV-positive group had lower mean age (49.2 versus 57.8, p < 0.005), higher proportion of male sex (75.9 % versus 48.1 %, p < 0.005), and lower rates of coronary calcification (29.2 % versus 58.2 %, p < 0.005). Mean EAT volume was also lower in the HIV-positive group (68mm3 versus 118.3mm3, p < 0.005). Multiple linear regression demonstrated EAT volume was associated with hepatosteatosis (HS) in the HIV-positive group but not the HIV-negative group after adjustment for BMI (p < 0.005 versus p = 0.066). In the multivariate analysis, after adjustment for CVD risk factors, age, sex, statin use, and body mass index (BMI), EAT volume and hepatosteatosis were significantly associated with coronary calcification (odds ratio [OR] 1.14, p < 0.005 and OR 3.17, p < 0.005 respectively). In the HIV-negative group, the only significant association with EAT volume after adjustment was total cholesterol (OR 0.75, p = 0.012).We demonstrated a strong and significant independent association of EAT volume and coronary calcium, after adjustment, in HIV-positive group but not in the HIV-negative group. This result hints at differences in the mechanistic drivers of atherosclerosis between HIV-positive and HIV-negative groups.
DOI: 10.1007/s42452-023-05554-x
2023
Mapping the global free expression landscape using machine learning
Abstract Freedom of expression is a core human right, yet the forces that seek to suppress it have intensified, increasing the need to develop tools that can measure the rates of freedom globally. In this study, we propose a novel freedom of expression index to gain a nuanced and data-led understanding of the level of censorship across the globe. For this, we used an unsupervised, probabilistic machine learning method, to model the status of the free expression landscape. This index seeks to provide legislators and other policymakers, activists and governments, and non-governmental and intergovernmental organisations, with tools to better inform policy or action decisions. The global nature of the proposed index also means it can become a vital resource/tool for engagement with international and supranational bodies.
DOI: 10.1093/mnras/stad3768
2023
Machine Learning applications for Cataclysmic Variable discovery in the ZTF alert stream
Cataclysmic variables (CV) encompass a diverse array of accreting white dwarf binary systems. Each class of CV represents a snapshot along an evolutionary journey, one with the potential to trigger a type Ia supernova event. The study of CVs offers valuable insights into binary evolution and accretion physics, with the rarest examples potentially providing the deepest insights. However, the escalating number of detected transients, coupled with our limited capacity to investigate them all, poses challenges in identifying such rarities. Machine Learning (ML) plays a pivotal role in addressing this issue by facilitating the categorisation of each detected transient into its respective transient class. Leveraging these techniques, we have developed a two-stage pipeline tailored to the ZTF transient alert stream. The first stage is an alerts filter aimed at removing non-CVs, while the latter is an ML classifier produced using XGBoost, achieving a macro average AUC score of 0.92 for distinguishing between CV classes. By utilising the Generative Topographic Mapping algorithm with classifier posterior probabilities as input, we obtain representations indicating that CV evolutionary factors play a role in classifier performance, while the associated feature maps present a potent tool for identifying the features deemed most relevant for distinguishing between classes. Implementation of the pipeline in June 2023 yielded 51 intriguing candidates that are yet to be reported as CVs or classified with further granularity. Our classifier represents a significant step in the discovery and classification of different CV classes, a domain of research still in its infancy.
DOI: 10.3390/a17010006
2023
Predicting Decompensation Risk in Intensive Care Unit Patients Using Machine Learning
Patients in Intensive Care Units (ICU) face the threat of decompensation, a rapid decline in health associated with a high risk of death. This study focuses on creating and evaluating machine learning (ML) models to predict decompensation risk in ICU patients. It proposes a novel approach using patient vitals and clinical data within a specified timeframe to forecast decompensation risk sequences. The study implemented and assessed long short-term memory (LSTM) and hybrid convolutional neural network (CNN)-LSTM architectures, along with traditional ML algorithms as baselines. Additionally, it introduced a novel decompensation score based on the predicted risk, validated through principal component analysis (PCA) and k-means analysis for risk stratification. The results showed that, with PPV = 0.80, NPV = 0.96 and AUC-ROC = 0.90, CNN-LSTM had the best performance when predicting decompensation risk sequences. The decompensation score’s effectiveness was also confirmed (PPV = 0.83 and NPV = 0.96). SHAP plots were generated for the overall model and two risk strata, illustrating variations in feature importance and their associations with the predicted risk. Notably, this study represents the first attempt to predict a sequence of decompensation risks rather than single events, a critical advancement given the challenge of early decompensation detection. Predicting a sequence facilitates early detection of increased decompensation risk and pace, potentially leading to saving more lives.
2010
Segmentation of EMG time series using a variational Bayesian approach for the robust estimation of cortical silent periods
A variational Bayesian formulation for a manifold-constrained Hidden Markov Model is used in this paper to segment a set of multivari- ate time series of electromyographic recordings corresponding to stroke patients and control subjects. An index of variability associated to this model is defined and applied to the robust detection of the silent period interval of the signal. The accuracy in the estimation of the duration of this interval is paramount to assess the rehabilitation of stroke patients. The Transcranial Magnetic Stimulation (TMS) of the cerebral motor cortex can evoke waves in the electromyographic (EMG) recording of muscle activity. Cor- tical stimulation can elicit excitatory as well as inhibitory effects. One of the latter is called the cortical silent period (CSP). When TMS is delivered over the motor cortex while the subjects maintain voluntary muscle contraction, the CSP is a pause in ongoing EMG activities that follows the motor-evoked potential. The duration of the CSP is an important parameter to gauge the recovery of stroke patients and to provide them with a prognosis. It is known (1) that the shortening of the SP in the affected side is related to an increase of its excitability, indicating an improvement of the motor function of the patients. The measurement of the CSP is sometimes troublesome due to the nature of the signal. The existing measurement methods are yet imprecise and are known to yield a significant error due to the sensitivity to noise of this kind of data (2). The main purpose of this study is to provide an accurate technique for CSP estimation based on a multivariate time series (MTS) segmentation process that behaves robustly in the presence of noise. For this, we resort to a manifold- constrained Hidden Markov Model (HMM). The formulation of this model within a variational Bayesian framework imbues it with regularization properties that minimize the negative effect of the presence of noise in the EMG MTS. A novel index of variability (IV ) is defined for this model. It is capable of providing reli- able estimates of the CSP duration by pinpointing its offset time with precision. ∗ This research is partially funded by Spanish MICINN project TIN2009-13895-C02-01.
DOI: 10.1007/11875581_5
2006
Cited 3 times
Time Series Relevance Determination Through a Topology-Constrained Hidden Markov Model
Most of the existing research on multivariate time series concerns supervised forecasting problems. In comparison, little research has been devoted to unsupervised methods for the visual exploration of this type of data. The interpretability of time series clustering results may be difficult, even in exploratory visualization, for high dimensional datasets. In this paper, we define and test an unsupervised time series relevance determination method for Generative Topographic Mapping Through Time, a topology-constrained Hidden Markov Model that performs simultaneous time series data clustering and visualization. This relevance determination method can be used as a basis for time series selection, and should ease the interpretation of the time series clustering results.
2008
Variational bayesian algorithms for generative topographic mapping and its extensions
This thesis is the result of our general interest in the study of the Generative Topographic Mapping (GTM), a non-linear Latent Variable Model originally proposed as a probabilistic alternative to the well-known Self-Organizing Maps to visualize and cluster high dimensional data by extracting their low-dimensional hidden inherent structures, Over the last decade, it has been extended to tackle other data problems, and it has been applied in a wide variety of areas. The standard GTM (for multivariate static data) as well as the GTM Through Time (for multivariate time series) make use of the Maximum Likelihood framework through the Expectation-Maximization (EM) algorithm to compute the local-optimum values of its parameters. However, this approximation is often too drastic to handle the high-dimensional, multi-modal and strongly correlated data that can be encountered, therefore risking data overfitting. In this thesis, we first present an exhaustive set of experiments to assess the capabilities of the standard GTM Through Time model for clustering and visualization of multivariate time series. A novel index of variability that allows measuring the degree of variability of a subsequence is introduced, and an unsupervised Time Series Relevance Determination method is proposed. The latter allows ranking the time series of a dataset in terms of their relevance for the clustering of subsequences. The fully Bayesian modelling of the GTM, as well as its implementation through a variational Bayesian approach, namely the Variational Bayesian GTM (VBGTM), constitute novel and significant contributions of this thesis. Despite the fact that the risk of data overfitting in standard GTM was known from inception, this thesis, to the best of our knowledge, provides the first complete solution to this problem. The Variational Bayesian GTM-TT (VBGTM-TT), an elegant solution to deal with the data overfitting problem in GTM Through Time for the analysis of multivariate time series, is also defined. The high risk of data overfitting due to the elevated number of free parameters was one of the most relevant weakness of the original GTM Through Time model. Several experiments using artificial and real data show that the proposed VBGTM and VBGTM-TT models outperform their standard counterparts, GTM and GTM-TT, respectively, in terms of generalization capabilities and data visualization. Finally, some of the many possible novel extensions of GTM to be developed within the proposed Variational Bayesian framework, focusing on unsupervised relevance determination in some detail, are also outlined in the closing chapters of the thesis.
DOI: 10.1007/978-3-030-19642-4_30
2019
Classifying and Grouping Mammography Images into Communities Using Fisher Information Networks to Assist the Diagnosis of Breast Cancer
The aim of this paper is to build a computer based clinical decision support tool using a semi-supervised framework, the Fisher Information Network (FIN), for visualization of a set of mammographic images. The FIN organizes the images into a similarity network from which, for any new image, reference images that are closely related can be identified. This enables clinicians to review not just the reference images but also ancillary information e.g. about response to therapy. The Fisher information metric defines a Riemannian space where distances reflect similarity with respect to a given probability distribution. This metric is informed about generative properties of data, and hence assesses the importance of directions in space of parameters. It automatically performs feature relevance detection. This approach focusses on the interpretability of the model from the standpoint of the clinical user. Model predictions were validated using the prevalence of classes in each of the clusters identified by the FIN.
DOI: 10.1109/ijcnn.2019.8852074
2019
Benchmarking multi-task learning in predictive models for drug discovery
Being able to predict the activity of chemical compounds against a drug target (e.g. a protein) in the preliminary stages of drug development is critical. In drug discovery, this is known as Quantitative Structure Activity Relationships (QSARs). Datasets for QSARs are often ill-posed for traditional machine learning to provide meaningful insights (e.g. very high dimensionality). Here, we propose a multi-task learning (MTL) approach to enrich the original QSAR datasets with the hope of improving overall QSAR performance. The proposed approach, henceforth named MTL-AT, increases the size of the useable data by the use of an assistant task: a supplementary dataset formed by compounds automatically extracted from other possibly related tasks. The main novelty in our MTL-AT approach is the addition of control for data leakage. We tested MTL-AT in two drug discovery scenarios: 1) using 100 unrelated QSAR datasets, and 2) using 20 QSAR datasets that are related to the same protein family. Results were compared against equivalent single-task approach (STL). MTL-AT outperformed STL in 45 tasks of scenario 1, and in 12 tasks of scenario 2. The best overall method appears to be MTL-AT on both scenarios, with the small datasets yielded the best performance improvement from using multi-task learning. These results show that implementing multi-task learning with QSAR data has promise, but more investigation is required to test its applicability to certain features in datasets to make it suitable for widespread use in the drug discovery area. To the best of our knowledge, this is the first study that benchmarks the use of MTL on a large number of small datasets, which represents a more realistic scenario in drug development.
DOI: 10.1109/icdabi51230.2020.9325691
2020
Using MLP partial responses to explain in-hospital mortality in ICU
In this paper we propose to use partial responses derived from an initial multilayer perceptron (MLP) to build an explanatory risk prediction model of in-hospital mortality in intensive care units (ICU). Traditionally, MLPs deliver higher performance than linear models such as multivariate logistic regression (MLR). However, MLPs interlink input variables in such a complex way that is not straightforward to explain how the outcome is influenced by inputs and/or input interactions. In this paper, we hypothesized that in some scenarios, such as when the data noise is significant or when the data is just marginally non-linear, we could find slightly more complex associations by obtaining MLP partial responses. That is, by letting change one variable at the time, while keeping constant the rest. Overall, we found that, although the MLR and MLP in-hospital mortality model performances were equivalent, the MLP could explain non-linear associations that otherwise the MLR had considered non-significant. We considered that, although deeming higher-other interactions as disposable noise could be a strong assumption, building explanatory models based on the MLP partial responses could still be more informative than on MLR.
DOI: 10.1097/qai.0000000000002721
2021
Associations of Hepatosteatosis With Cardiovascular Disease in HIV-Positive and HIV-Negative Patients: The Liverpool HIV–Heart Project
Background: Hepatosteatosis (HS) has been associated with cardiovascular disorders in the general population. We sought to investigate whether HS is a marker of cardiovascular disease (CVD) risk in HIV-positive individuals, given that metabolic syndrome is implicated in the increasing CVD burden in this population. Aims: To investigate the association of HS with CVD in HIV-positive and HIV-negative individuals. Methods and results: We analyzed computed tomography (CT) images of 1306 subjects of whom 209 (16%) were HIV-positive and 1097 (84%) HIV-negative. CVD was quantified by the presence of coronary calcification from both dedicated cardiac CT and nondedicated thorax CT. HS was diagnosed from CT data sets in those with noncontrast dedicated cardiac CT and those with venous phase liver CT using previously validated techniques. Previous liver ultrasound was also assessed for the presence of HS. The HIV-positive group had lower mean age ( P &lt; 0.005), higher proportions of male sex ( P &lt; 0.005), and more current smokers ( P &lt; 0.005). The HIV-negative group had higher proportions of hypertension ( P &lt; 0.005), type II diabetes ( P = 0.032), dyslipidemia ( P &lt; 0.005), statin use ( P = 0.008), and HS ( P = 0.018). The prevalence of coronary calcification was not significantly different between the groups. Logistic regression (LR) demonstrated that in the HIV-positive group, increasing age [odds ratio (OR): 1.15, P &lt; 0.005], male sex (OR 3.37, P = 0.022), and HS (OR 3.13, P = 0.005) were independently associated with CVD. In the HIV-negative group, increasing age (OR: 1.11, P &lt; 0.005), male sex (OR 2.97, P &lt; 0.005), current smoking (OR 1.96, P &lt; 0.005), and dyslipidemia (OR 1.66, P = 0.03) were independently associated with CVD. Using a machine learning random forest algorithm to assess the variables of importance, the top 3 variables of importance in the HIV-positive group were age, HS, and male sex. In the HIV-negative group, the top 3 variables were age, hypertension and male sex. The LR models predicted CVD well, with the mean area under the receiver operator curve (AUC) for the HIV-positive and HIV-negative cohorts being 0.831 [95% confidence interval (CI): 0.713 to 0.928] and 0.786 (95% CI: 0.735 to 0.836), respectively. The random forest models outperformed LR models, with a mean AUC in HIV-positive and HIV-negative populations of 0.877 (95% CI: 0.775 to 0.959) and 0.828 (95% CI: 0.780 to 0.873) respectively, with differences between both methods being statistically significant. Conclusion: In contrast to the general population, HS is a strong and independent predictor of CVD in HIV-positive individuals. This suggests that metabolic dysfunction may be attributable to the excess CVD risk seen with these patient groups. Assessment of HS may help accurate quantification of CVD risk in HIV-positive patients.
2007
A variational formulation for GTM through time: Theoretical foundations
DOI: 10.1109/cidm.2014.7008654
2014
Automatic relevance source determination in human brain tumors using Bayesian NMF
The clinical management of brain tumors is very sensitive; thus, their non-invasive characterization is often preferred. Non-negative Matrix Factorization techniques have been successfully applied in the context of neuro-oncology to extract the underlying source signals that explain different tissue tumor types, for which knowing the number of sources to calculate was always required. In the current study we estimate the number of relevant sources for a set of discrimination problems involving brain tumors and normal brain. For this, we propose to start by calculating a high number of sources using Bayesian NMF and automatically discarding the irrelevant ones during the iterative process of matrices decomposition, hence obtaining a reduced range of interpretable solutions. The real data used in this study come from a widely tested human brain tumor database. Simulated data that resembled the real data was also generated to validate the hypothesis against ground truth. The results obtained suggest that the proposed approach is able to provide a small range of meaningful solutions to the problem of source extraction in human brain tumors.
DOI: 10.1016/j.xcrm.2022.100875
2022
New use for an old drug: Metformin and atrial fibrillation
Lal and colleagues1 reported an integrative approach-combining transcriptomics, iPSCs, and epidemiological evidence-to identify and repurpose metformin, a main first-line medication for the treatment of type 2 diabetes, as an effective risk reducer for atrial fibrillation.
DOI: 10.1109/ijcnn55064.2022.9892114
2022
Towards interpretable machine learning for clinical decision support
A major challenge in delivering reliable and trustworthy computational intelligence for practical applications in clinical medicine is interpretability. This aspect of machine learning is a major distinguishing factor compared with traditional statistical models for the stratification of patients, which typically use rules or a risk score identified by logistic regression. We show how functions of one and two variables can be extracted from pre-trained machine learning models using anchored Analysis of Variance (ANOVA) decompositions. This enables complex interaction terms to be filtered out by aggressive regularisation using the Least Absolute Shrinkage and Selection Operator (LASSO) resulting in a sparse model with comparable or even better performance than the original pre-trained black-box. Besides being theoretically well-founded, the decomposition of a black-box multivariate probabilistic binary classifier into a General Additive Model (GAM) comprising a linear combination of non-linear functions of one or two variables provides full interpretability. In effect this extends logistic regression into non-linear modelling without the need for manual intervention by way of variable transformations, using the pre-trained model as a seed. The application of the proposed methodology to existing machine learning models is demonstrated using the Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), Random Forests (RF) and Gradient Boosting Machines (GBM), to model a data frame from a well-known benchmark dataset available from Physionet, the Medical Information Mart for Intensive Care (MIMIC-III). Both the classification performance and plausibility of clinical interpretation compare favourably with other state-of-the-art sparse models namely Sparse Additive Models (SAM) and the Explainable Boosting Machine (EBM).
2010
Spectral Prototype Extraction for dimensionality reduction in brain tumour diagnosis
Diagnosis in neuro-oncology can be assisted by non-invasive data acquisition techniques such as Magnetic Resonance Spectroscopy (MRS). From the viewpoint of computer-based brain tumour classification, the high dimensionality of MRS poses a difficulty, and the use of dimension- ality reduction (DR) techniques is advisable. Despite some important limi- tations, Principal Component Analysis (PCA) is commonly used for DR in MRS data analysis. Here, we define a novel DR technique, namely Spectral Prototype Extraction, based on a manifold-constrained Hidden Markov Model (HMM). Its formulation within a variational Bayesian framework imbues it with regularization properties that minimize the negative effect of the presence of noise in the data. Its use for MRS pre-processing is illustrated in a difficult brain tumour classification problem.
DOI: 10.4018/978-1-60566-766-9.ch008
2010
Clustering and Visualization of Multivariate Time Series
The exploratory investigation of multivariate time series (MTS) may become extremely difficult, if not impossible, for high dimensional datasets. Paradoxically, to date, little research has been conducted on the exploration of MTS through unsupervised clustering and visualization. In this chapter, the authors describe generative topographic mapping through time (GTM-TT), a model with foundations in probability theory that performs such tasks. The standard version of this model has several limitations that limit its applicability. Here, the authors reformulate it within a Bayesian approach using variational techniques. The resulting variational Bayesian GTM-TT, described in some detail, is shown to behave very robustly in the presence of noise in the MTS, helping to avert the problem of data overfitting.
2007
Variational GTM
Generative Topographic Mapping (GTM) is a non-linear latent variable model that provides simultaneous visualization and clustering of high-dimensional data. It was originally formulated as a constrained mixture of distributions, for which the adaptive parameters were determined by Maximum Likelihood (ML), using the Expectation-Maximization (EM) algorithm. In this paper, we define an alternative variational formulation of GTM that provides a full Bayesian treatment to a Gaussian Process (GP)-based variation of GTM. The performance of the proposed Variational GTM is assessed in several experiments with artificial datasets. These experiments highlight the capability of Variational GTM to avoid data overfitting through active regularization.
DOI: 10.1007/978-3-030-19642-4_29
2019
A Voting Ensemble Method to Assist the Diagnosis of Prostate Cancer Using Multiparametric MRI
Prostate cancer is the second most commonly occurring cancer in men. Diagnosis through Magnetic Resonance Imaging (MRI) is limited, yet current practice holds a relatively low specificity. This paper extends a previous SPIE ProstateX challenge study in three ways (1) to include healthy tissue analysis, creating a solution suitable for clinical practice, which has been requested and validated by collaborating clinicians; (2) by using a voting ensemble method to assist prostate cancer diagnosis through a supervised SVM approach; and (3) using the unsupervised GTM to provide interpretability to understand the supervised SVM classification results. Pairwise classifiers of clinically significant lesion, non-significant lesion, and healthy tissue, were developed. Results showed that when combining multiparametric MRI and patient level metadata, classification of significant lesions against healthy tissue attained an AUC of 0.869 (10-fold cross-validation).
DOI: 10.1007/978-3-030-33617-2_13
2019
Comparative Analysis for Computer-Based Decision Support: Case Study of Knee Osteoarthritis
This case study benchmarks a range of statistical and machine learning methods relevant to computer-based decision support in clinical medicine, focusing on the diagnosis of knee osteoarthritis at first presentation. The methods, comprising logistic regression, Multilayer Perceptron (MLP), Chi-square Automatic Interaction Detector (CHAID) and Classification and Regression Trees (CART), are applied to a public domain database, the Osteoarthritis Initiative (OAI), a 10 year longitudinal study starting in 2002 (n = 4,796). In this real-world application, it is shown that logistic regression is comparable with the neural networks and decision trees for discrimination of positive diagnosis on this data set. This is likely because of weak non-linearities among high levels of noise. After comparing the explanations provided by the different methods, it is concluded that the interpretability of the risk score index provided by logistic regression is expressed in a form that most naturally integrates with clinical reasoning. The reason for this is that it gives a statistical assessment of the weight of evidence for making the diagnosis, so providing a direction for future research to improve explanation of generic non-linear models.
DOI: 10.1007/978-3-030-65965-3_29
2020
Efficient Estimation of General Additive Neural Networks: A Case Study for CTG Data
This paper discusses the concepts of interpretability and explainability and outlines desiderata for robust interpretability. It then describes a neural network model that meets all criteria, with the addition of global faithfulness. This is achieved by efficient estimation of a General Additive Neural Network, seeded by a conventional Multilayer Perceptron (MLP) by distilling the dependence on individual variables and pairwise interactions, so that their effects can be represented within the structure of a General Additive Model. This makes the logic of the model clear and transparent to users, across the complete input space. The model is self-explaining. The modelling approach used in this paper derives the partial responses from the MLP, resulting in the Partial Response Network (PRN). Its application is illustrated in a medical context using the CTU-UHB Cardiotacography intrapartum database (n = 552) to infer the features associated with caesarean deliveries. This is the first application of the PRN to this data set and it is shown that the self-explaining model achieves comparable discrimination performance to that of Random Forests previously applied to the same data set. The classes are highly imbalanced with a prevalence of caesarean sections of 8.33%. The resulting model uses 4 from 8 possible features and has an AUROC of 0.69 [CI 0.60, 0.77] estimated by 4-fold cross-validation. Its performance and features are compared also with those from a Sparse Additive Models (SAM) which has an AUROC of 0.72 [CI 0.64, 0.80]. This is not significantly different and requires all features. For clinical utility by risk stratification, the odds-ratio for caesarian section vs. not at the prevalence threshold is 3.97 for the PRN, better 3.14 for the SAM. Compared for consistency, parsimony, stability and scalability the models have complementary properties.
DOI: 10.1007/978-3-540-77226-2_9
2007
Variational GTM
Generative Topographic Mapping (GTM) is a non-linear latent variable model that provides simultaneous visualization and clustering of high-dimensional data. It was originally formulated as a constrained mixture of distributions, for which the adaptive parameters were determined by Maximum Likelihood (ML), using the Expectation-Maximization (EM) algorithm. In this paper, we define an alternative variational formulation of GTM that provides a full Bayesian treatment to a Gaussian Process (GP)-based variation of GTM. The performance of the proposed Variational GTM is assessed in several experiments with artificial datasets. These experiments highlight the capability of Variational GTM to avoid data overfitting through active regularization.
DOI: 10.48550/arxiv.1908.05978
2019
The Partial Response Network: a neural network nomogram
Among interpretable machine learning methods, the class of Generalised Additive Neural Networks (GANNs) is referred to as Self-Explaining Neural Networks (SENN) because of the linear dependence on explicit functions of the inputs. In binary classification this shows the precise weight that each input contributes towards the logit. The nomogram is a graphical representation of these weights. We show that functions of individual and pairs of variables can be derived from a functional Analysis of Variance (ANOVA) representation, enabling an efficient feature selection to be carried by application of the logistic Lasso. This process infers the structure of GANNs which otherwise needs to be predefined. As this method is particularly suited for tabular data, it starts by fitting a generic flexible model, in this case a Multi-layer Perceptron (MLP) to which the ANOVA decomposition is applied. This has the further advantage that the resulting GANN can be replicated as a SENN, enabling further refinement of the univariate and bivariate component functions to take place. The component functions are partial responses hence the SENN is a partial response network. The Partial Response Network (PRN) is equally as transparent as a traditional logistic regression model, but capable of non-linear classification with comparable or superior performance to the original MLP. In other words, the PRN is a fully interpretable representation of the MLP, at the level of univariate and bivariate effects. The performance of the PRN is shown to be competitive for benchmark data, against state-of-the-art machine learning methods including GBM, SVM and Random Forests. It is also compared with spline-based Sparse Additive Models (SAM) showing that a semi-parametric representation of the GAM as a neural network can be as effective as the SAM though less constrained by the need to set spline nodes.
2014
Severe mental illness in the UK (2000-2012): rising disparities in comorbidities and resource use in primary care
2015
Primary care consultation rates among people with and without severe mental illness: a UK cohort study using the Clinical Practice Research Database
2015
Meta-QSAR: learning how to learn QSARs
Quantitative structure activity relationships (QSARs) are functions that predict bioactivity from compound structure. Although almost every form of statistical and machine learning method has been applied to learning QSARs, there is no single best way of learning QSARs. Therefore, currently the QSAR scientist has little to guide her/him on which QSAR approach to choose for a specific problem. The aim of this work is to introduce Meta-QSAR, a meta-learning approach aimed to learning which QSAR method is most appropriate for a particular problem. For the preliminary results presented here, we used ChEMBL, a public available chemoinformatic database, to systematically run extensive comparative QSAR experiments. We further apply meta-learning in order to generalise these results.
2014
Can analyses of electronic patient records be independently and externally validated? The effect of statins on the mortality of patients with ischaemic heart disease: a cohort study with nested case–control analysis
DOI: 10.6084/m9.figshare.1008899
2014
list of 374 UK Primary Care Database Studies metadata and scores for transparency of clinical coding
A large component of total EMR research is made up by primary care database (PCD) studies and UK PCDs are among the most researched in the world. As one of the largest and most<br>important resources for EMR-based research, it seems reasonable to expect reporting of code lists in UK PCD-based studies to be at least as comprehensive as in other EMR studies. To<br>evaluate levels of transparency in the reporting of clinical code lists, we took a representative sample of UK PCD studies and assessed each study on its extent of reporting of the clinical codes used. We took a sample of 450 papers from the original 1359 identified from a PubMed search. Of these, 374 (83\%) had both the full text accessible to the University of Manchester library and were examples of primary PCD research. Only 5\% (19 of 392) studies published the entire set of clinical codes needed to reproduce the study (usually in an online<br>appendix), while only an additional 9\% (32 of 392) stated explicitly that the clinical codes were available upon request. In a subset of<br>articles published since 2008, 6.9\% (16 of 231) published the entire set of codes and 10.4\% (24 of 231 )stated that clinical codes were available upon request. See https://github.com/rOpenHealth/ClinicalCodes/tree/master/paper
DOI: 10.6084/m9.figshare.1008900
2014
Clinicalcodes.org example JSON research object
Example JSON research object output from www.clinicalcodes.org for clinical codes for a research article. see https://github.com/rOpenHealth/ClinicalCodes/tree/master/paper
DOI: 10.1016/j.ijpsycho.2012.06.150
2012
A Bayesian model of EEG/MEG source dynamics and effective connectivity
2013
Modelling conditions in the CPRD: an application to severe mental illness.
DOI: 10.4018/978-1-4666-1803-9.ch013
2012
Kernel Generative Topographic Mapping of Protein Sequences
The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. The –omics sciences bring about the challenge of how to deal with the large amounts of complex data they generate from an intelligence data analysis perspective. In this chapter, the authors focus on the analysis of a specific type of proteins, the G protein-couple receptors, which are the target for over 15% of current drugs. They describe a kernel method of the manifold learning family for the analysis of protein amino acid symbolic sequences. This method sheds light on the structure of protein subfamilies, while providing an intuitive visualization of such structure.
DOI: 10.4018/978-1-4666-3604-0.ch044
2013
Kernel Generative Topographic Mapping of Protein Sequences
The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. The –omics sciences bring about the challenge of how to deal with the large amounts of complex data they generate from an intelligent data analysis perspective. In this chapter, the authors focus on the analysis of a specific type of proteins, the G protein-coupled receptors, which are the target for over 15% of current drugs. They describe a kernel method of the manifold learning family for the analysis of protein amino acid symbolic sequences. This method sheds light on the structure of protein subfamilies, while providing an intuitive visualization of such structure.
DOI: 10.48550/arxiv.1709.03854
2017
Meta-QSAR: a large-scale application of meta-learning to drug design and discovery
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 6 molecular representations, applied to more than 2,700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.
DOI: 10.21203/rs.3.rs-1529645/v1
2022
Enhanced survival prediction using explainable artificial intelligence in heart transplantation
Abstract The most limiting factor in heart transplantation is lack of donor organs. With enhanced prediction of outcome, it may be possible to increase the life-years from the organs that become available. Applications of machine learning to tabular data, typical of clinical decision support, pose the practical question of interpretation which has technical and potential ethical implications. In particular, there is an issue of principle about the predictability of complex data and whether this is inherent in the data or strongly dependent on the choice of machine learning model, leading to the so-called accuracy-interpretability trade-off. We model one-year mortality in heart transplantation data with a self-explaining neural network, which is benchmarked against a deep learning model on the same development data, in an external validation study with two data sets: 1) UNOS transplants in 2017-2018 (n= 4,750) for which the self-explaining and deep learning models are comparable in their AUROC 0.628 [0.602,0.654] cf. 0.635 [0.609,0.662] and 2) Scandinavian transplants during 1997-2018 (n= 2,293) showing good calibration with AUROC of 0.626 [0.588,0.665] and 0.634 [0.570, 0.698] respectively with and without missing data (n=982). This shows that for tabular data predictive models can be transparent and capture important non-linearities, retaining full predictive performance.
DOI: 10.18359/rcin.1410
1999
Una introducción a la robótica industrial
La Robotica Industrial constituye hoy, una de las mas importantes areas de investigacion y desarrollo tecnologico. Este articulo trata en forma global que es robotica industrial y como es la estructura interna de un robot tipico. Inicialmente se presentan los antecedentes de la robotica,  la definicion y sus componentes. A continuacion  se presentan las caracteristicas de los diferentes sistemas que conforman un robot.
DOI: 10.48550/arxiv.1811.03392
2018
Transformative Machine Learning
The key to success in machine learning (ML) is the use of effective data representations. Traditionally, data representations were hand-crafted. Recently it has been demonstrated that, given sufficient data, deep neural networks can learn effective implicit representations from simple input representations. However, for most scientific problems, the use of deep learning is not appropriate as the amount of available data is limited, and/or the output models must be explainable. Nevertheless, many scientific problems do have significant amounts of data available on related tasks, which makes them amenable to multi-task learning, i.e. learning many related problems simultaneously. Here we propose a novel and general representation learning approach for multi-task learning that works successfully with small amounts of data. The fundamental new idea is to transform an input intrinsic data representation (i.e., handcrafted features), to an extrinsic representation based on what a pre-trained set of models predict about the examples. This transformation has the dual advantages of producing significantly more accurate predictions, and providing explainable models. To demonstrate the utility of this transformative learning approach, we have applied it to three real-world scientific problems: drug-design (quantitative structure activity relationship learning), predicting human gene expression (across different tissue types and drug treatments), and meta-learning for machine learning (predicting which machine learning methods work best for a given problem). In all three problems, transformative machine learning significantly outperforms the best intrinsic representation.
2019
The Partial Response Network.
DOI: 10.5565/ddd.uab.cat/201551
2019
MRI-MRSI-GL261 mice