ϟ

A. Woodard

Here are all the papers by A. Woodard that you can download and read on OA.mg.
A. Woodard’s last known institution is . Download A. Woodard PDFs here.

Claim this Profile →

DOI: 10.1145/3307681.3325400

Cited 149 times

High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking how parallelism is expressed in programs. Here, we present Parsl, a parallel scripting library that augments Python with simple, scalable, and flexible constructs for encoding parallelism. These constructs allow Parsl to construct a dynamic dependency graph of components that it can then execute efficiently on one or many processors. Parsl is designed for scalability, with an extensible set of executors tailored to different use cases, such as low-latency, high-throughput, or extreme-scale execution. We show, via experiments on the Blue Waters supercomputer, that Parsl executors can allow Python scripts to execute components with as little as 5 ms of overhead, scale to more than 250000 workers across more than 8000 nodes, and process upward of 1200 tasks per second. Other Parsl features simplify the construction and execution of composite programs by supporting elastic provisioning and scaling of infrastructure, fault-tolerant execution, and integrated wide-area data management. We show that these capabilities satisfy the needs of many-task, interactive, online, and machine learning applications in fields such as biology, cosmology, and materials science.

DOI: 10.1145/3369583.3392683

Cited 102 times

funcX: A Federated Function Serving Fabric for Science

Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130 000 concurrent workers.

DOI: 10.1038/s41523-023-00530-5

Integration of clinical features and deep learning on pathology for the prediction of breast cancer recurrence assays and risk of recurrence

Gene expression-based recurrence assays are strongly recommended to guide the use of chemotherapy in hormone receptor-positive, HER2-negative breast cancer, but such testing is expensive, can contribute to delays in care, and may not be available in low-resource settings. Here, we describe the training and independent validation of a deep learning model that predicts recurrence assay result and risk of recurrence using both digital histology and clinical risk factors. We demonstrate that this approach outperforms an established clinical nomogram (area under the receiver operating characteristic curve of 0.83 versus 0.76 in an external validation cohort, p = 0.0005) and can identify a subset of patients with excellent prognoses who may not need further genomic testing.

DOI: 10.1109/ipdps.2019.00038

DLHub: Model and Data Serving for Science

While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the “learning systems” needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its self-service model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.

DOI: 10.1038/s41467-021-27079-w

Whole-genome analysis of Nigerian patients with breast cancer reveals ethnic-driven somatic evolution and distinct genomic subtypes

Black women across the African diaspora experience more aggressive breast cancer with higher mortality rates than white women of European ancestry. Although inter-ethnic germline variation is known, differential somatic evolution has not been investigated in detail. Analysis of deep whole genomes of 97 breast cancers, with RNA-seq in a subset, from women in Nigeria in comparison with The Cancer Genome Atlas (n = 76) reveal a higher rate of genomic instability and increased intra-tumoral heterogeneity as well as a unique genomic subtype defined by early clonal GATA3 mutations with a 10.5-year younger age at diagnosis. We also find non-coding mutations in bona fide drivers (ZNF217 and SYPL1) and a previously unreported INDEL signature strongly associated with African ancestry proportion, underscoring the need to expand inclusion of diverse populations in biomedical research. Finally, we demonstrate that characterizing tumors for homologous recombination deficiency has significant clinical relevance in stratifying patients for potentially life-saving therapies.

DOI: 10.1016/j.jpdc.2020.08.006

DLHub: Simplifying publication, discovery, and use of machine learning models in science

Machine Learning (ML) has become a critical tool enabling new methods of analysis and driving deeper understanding of phenomena across scientific disciplines. There is a growing need for “learning systems” to support various phases in the ML lifecycle. While others have focused on supporting model development, training, and inference, few have focused on the unique challenges inherent in science, such as the need to publish and share models and to serve them on a range of available computing resources. In this paper, we present the Data and Learning Hub for science (DLHub), a learning system designed to support these use cases. Specifically, DLHub enables publication of models, with descriptive metadata, persistent identifiers, and flexible access control. It packages arbitrary models into portable servable containers, and enables low-latency, distributed serving of these models on heterogeneous compute resources. We show that DLHub supports low-latency model inference comparable to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, and improved performance, by up to 95%, with batching and memoization enabled. We also show that DLHub can scale to concurrently serve models on 500 containers. Finally, we describe five case studies that highlight the use of DLHub for scientific applications.

DOI: 10.1109/tpds.2022.3208767

<i>func</i>X: Federated Function as a Service for Science

ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X decouples the cloud-hosted management functionality from the edge-hosted execution functionality. ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X's endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, and supercomputers, in effect turning them into function serving systems. ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X's cloud-hosted service provides a single location for registering, sharing, and managing both functions and endpoints. It allows for transparent, secure, and reliable function execution across the federated ecosystem of endpoints—enabling users to route functions to endpoints based on specific needs. ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X uses containers (e.g., Docker, Singularity, and Shifter) to provide common execution environments across endpoints. ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X implements various container management strategies to execute functions with high performance and efficiency on diverse ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X endpoints. ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X also integrates with an in-memory data store and Globus for managing data that may span endpoints. We motivate the need for ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X, present our prototype design and implementation, and demonstrate, via experiments on two supercomputers, that ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X can scale to more than 130000 concurrent workers. We show that ƒ <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">unc</i> X's container warming-aware routing algorithm can reduce the completion time for 3,000 functions by up to 61% compared to a randomized algorithm and the in-memory data store can speed up data transfers by up to 3x compared to a shared file system.

DOI: 10.1148/ryai.220299

External Evaluation of a Mammography-based Deep Learning Model for Predicting Breast Cancer in an Ethnically Diverse Population

“Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To externally evaluate a mammography-based deep learning (DL) model (Mirai) in a high-risk, racially diverse population and compare its performance with other mammographic measures. Materials and Methods The authors retrospectively evaluated 6,435 screening mammograms acquired from 2,096 female patients (median age, 56.4 years [SD: 11.2]) enrolled in a hospital-based case-control study between 2006–2020. Pathologically-confirmed breast cancer was the primary outcome. Mirai scores were the primary predictors. Breast density and Breast Imaging Reporting and Data System (BI-RADS) assessment categories were comparative predictors. Performance was evaluated using area under the receiver operating characteristic curve (AUC) and concordance-index analyses. Results Mirai achieved one-and five-year AUCs of 0.71 (95% CI: 0.68, 0.74) and 0.65 (0.64, 0.67), respectively. 1-year AUCs in nondense versus dense breasts were 0.72 versus 0.58 (P = .10). There was no evidence of a difference in near-term discrimination performance between BI-RADS and Mirai (1-year AUCs, 0.73 versus 0.68, P = .34). For longer-term prediction (2–5 years), Mirai outperformed BI-RADS assessment (5-year AUC, 0.63 versus 0.54; P < .001). Using only images of the unaffected breast reduced the discriminatory performance of the DL model (P < .001 at all timepoints), suggesting that its predictions are likely dependent on the detection of ipsilateral premalignant patterns. Conclusion A mammography DL model showed good performance in a high-risk external dataset enriched for African American patients, benign breast disease, and BRCA mutation carriers, and study findings suggest that its performance is likely driven by the detection of precancerous changes. ©RSNA, 2023

DOI: 10.1007/s10552-023-01837-1

Benign breast disease and breast cancer risk in African women: a case–control study

DOI: 10.1103/physrevc.84.045802

First measurement of the<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msup><mml:mrow /><mml:mn>33</mml:mn></mml:msup></mml:math>Cl(<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:mrow><mml:mi>p</mml:mi><mml:mo>,</mml:mo><mml:mi>α</mml:mi></mml:mrow></mml:math>)<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline"><mml:msup><mml:mrow /><mml:mn>30</mml:mn></mml:msup></mml:math>S reaction

The ${}^{30}$S($\ensuremath{\alpha},\phantom{\rule{-0.16em}{0ex}}p$)${}^{33}$Cl reaction may have a significant impact on final elemental abundances and energy output of type I X-ray bursts, as well as influencing observables such as double-peaked luminosity profiles, because it could bypass the ${}^{30}$S waiting point. This reaction has been studied experimentally for the first time in inverse kinematics via the time-inverse reaction ${}^{1}$H(${}^{33}$Cl,${}^{30}$S)$\ensuremath{\alpha}$ with a ${}^{33}$Cl radioactive ion beam produced at the Argonne Tandem Linac Accelerator System facility by the ``in-flight'' technique. This reaction was studied at three different beam energies. The experimental method used and the resulting data are discussed.

DOI: 10.1145/3332186.3332231

Scalable Parallel Programming in Python with Parsl

Python is increasingly the lingua franca of scientific computing. It is used as a higher level language to wrap lower-level libraries and to compose scripts from various independent components. However, scaling and moving Python programs from laptops to supercomputers remains a challenge. Here we present Parsl, a parallel scripting library for Python. Parsl makes it straightforward for developers to implement parallelism in Python by annotating functions that can be executed asynchronously and in parallel, and to scale analyses from a laptop to thousands of nodes on a supercomputer or distributed system. We examine how Parsl is implemented, focusing on syntax and usage. We describe two scientific use cases in which Parsl's intuitive and scalable parallelism is used.

Parsl: Scalable parallel scripting in python

DOI: 10.1016/j.nuclphysa.2011.10.003

Breakup coupling effects on near-barrier inelastic scattering of the weakly bound 6Li projectile on a 144Sm target

Abstract Angular distributions for the inelastic scattering of the weakly bound 6Li nucleus from a 144Sm target (associated with the contributions of both the 2 1 + and 3 1 − 144Sm excited states together) were measured at bombarding energies close to the Coulomb barrier. The experimental data were compared with expected results based on continuum discretized coupled-channel (CDCC) calculations. The results confirm that it is essential to include continuum–continuum couplings to reproduce the experimental data. The analysis demonstrates that inelastic scattering data can be a critical tool in testing full CDCC calculations involving weakly bound nuclei.

DOI: 10.1109/cloudcom.2019.00045

ParaOpt: Automated Application Parameterization and Optimization for the Cloud

The variety of instance types available on cloud platforms offers enormous flexibility to match the requirements of applications with available resources. However, selecting the most suitable instance type and configuring an application to optimally execute on that instance type can be complicated and time-consuming. For example, application parallelism flags must match available cores and problem sizes must be tuned to match available memory. As the search space of application configurations can be enormous, we propose an automated approach, called ParaOpt, to automatically explore and tune application configurations on arbitrary cloud instances. ParaOpt supports arbitrary applications, enables use of custom optimization methods, and can be configured with different optimization targets such as runtime and cost. We evaluate ParaOpt by optimizing genomics, molecular dynamics, and machine learning applications with four types of optimizers. We show with as few as 15 parameterized executions of an application, representing between 1.2%-26.7% of the search space, that ParaOpt is able to identify the optimal configuration in 32.7% of experiments and a near-optimal configuration in 83.2% of cases. As a result of using near-optimal configurations, ParaOpt reduces overall execution time by up to 85.8% when compared with using the default configuration.

DOI: 10.48550/arxiv.1908.04907

Serverless Supercomputing: High Performance Function as a Service for Science

Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as "serverless supercomputing." We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.

DOI: 10.1088/1742-6596/664/6/062038

CMS distributed data analysis with CRAB3

The CMS Remote Analysis Builder (CRAB) is a distributed workflow management tool which facilitates analysis tasks by isolating users from the technical details of the Grid infrastructure. Throughout LHC Run 1, CRAB has been successfully employed by an average of 350 distinct users each week executing about 200,000 jobs per day.

DOI: 10.1145/3332186.3332246

Publishing and Serving Machine Learning Models with DLHub

In this paper we introduce the Data and Learning Hub for Science (DLHub). DLHub serves as a nexus for publishing, sharing, discovering, and reusing machine learning models. It provides a flexible publication platform that enables researchers to describe and deposit models by associating publication and model-specific metadata and assigning a persistent identifier for subsequent citation. DLHub also supports scalable model inference, allowing researchers to execute inference tasks using a distributed execution engine, containerized models, and Kubernetes. Here we describe DLHub and present four scientific use cases that illustrate how DLHub can be used to reliably, efficiently, and scalably integrate ML into scientific processes.

DOI: 10.1101/2022.07.07.499039

Multimodal Prediction of Breast Cancer Recurrence Assays and Risk of Recurrence

Abstract Gene expression-based recurrence assays are strongly recommended to guide the use of chemotherapy in hormone receptor-positive, HER2-negative breast cancer, but such testing is expensive, can contribute to delays in care, and may not be available in low-resource settings. Here, we describe the training and independent validation of a deep learning model that predicts recurrence assay result and risk of recurrence using both digital histology and clinical risk factors. We demonstrate that this approach outperforms an established clinical nomogram (area under the receiver operating characteristic curve of 0.833 versus 0.765 in an external validation cohort, p = 0.003), and can identify a subset of patients with excellent prognoses who may not need further genomic testing.

DOI: 10.1109/cluster.2015.53

Scaling Data Intensive Physics Applications to 10k Cores on Non-dedicated Clusters with Lobster

The high energy physics (HEP) community relies upon a global network of computing and data centers to analyze data produced by multiple experiments at the Large Hadron Collider (LHC). However, this global network does not satisfy all research needs. Ambitious researchers often wish to harness computing resources that are not integrated into the global network, including private clusters, commercial clouds, and other production grids. To enable these use cases, we have constructed Lobster, a system for deploying data intensive high throughput applications on non-dedicated clusters. This requires solving multiple problems related to non-dedicated resources, including work decomposition, software delivery, concurrency management, data access, data merging, and performance troubleshooting. With these techniques, we demonstrate Lobster running effectively on 10k cores, producing throughput at a level comparable with some of the largest dedicated clusters in the LHC infrastructure.

DOI: 10.1088/1742-6596/664/3/032022

A case study in preserving a high energy physics application with Parrot

The reproducibility of scientific results increasingly depends upon the preservation of computational artifacts. Although preserving a computation to be used later sounds easy, it is surprisingly difficult due to the complexity of existing software and systems. Implicit dependencies, networked resources, and shifting compatibility all conspire to break applications that appear to work well. To investigate these issues, we present a case study of a complex high energy physics application. We analyze the application and attempt several methods at extracting its dependencies for the purposes of preservation. We propose one fine-grained dependency management toolkit to preserve the application and demonstrate its correctness in three different environments - the original machine, one virtual machine from the Notre Dame Cloud Platform and one virtual machine from the Amazon EC2 Platform. We report on the completeness, performance, and efficiency of each technique, and offer some guidance for future work in application preservation.

DOI: 10.1109/escience.2016.7870889

Conducting reproducible research with Umbrella: Tracking, creating, and preserving execution environments

Publishing scientific results without the detailed execution environments describing how the results were collected makes it difficult or even impossible for the reader to reproduce the work. However, the configurations of the execution environments are too complex to be described easily by authors. To solve this problem, we propose a framework facilitating the conduct of reproducible research by tracking, creating, and preserving the comprehensive execution environments with Umbrella. The framework includes a lightweight, persistent and deployable execution environment specification, an execution engine which creates the specified execution environments, and an archiver which archives an execution environment into persistent storage services like Amazon S3 and Open Science Framework (OSF). The execution engine utilizes sandbox techniques like virtual machines (VMs), Linux containers and user-space tracers, to create an execution environment, and allows common dependencies like base OS images to be shared by sandboxes for different applications. We evaluate our framework by utilizing it to reproduce three scientific applications from epidemiology, scene rendering, and high energy physics. We evaluate the time and space overhead of reproducing these applications, and the effectiveness of the chosen archive unit and mounting mechanism for allowing different applications to share dependencies. Our results show that these applications can be reproduced using different sandbox techniques successfully and efficiently, even through the overhead and performance slightly vary.

DOI: 10.1109/ccgrid.2014.34

Opportunistic High Energy Physics Computing in User Space with Parrot

The computing needs of high energy physics experiments like the Compact Muon Solenoid experiment at the Large Hadron Collider currently exceed the available dedicated computational resources, hence motivating a push to leverage opportunistic resources. However, access to opportunistic resources faces many obstacles, not the least of which is making available the complex software stack typically associated with such computations. This paper describes a framework constructed using existing software packages to distribute the needed software to opportunistic resources without the need for the job to have root-level privileges. Preliminary tests with this framework have demonstrated the feasibility of the approach and identified bottlenecks as well as reliability issues which must be resolved in order to make this approach viable for broad use.

DOI: 10.1051/epjconf/201921403006

Improving efficiency of analysis jobs in CMS

Hundreds of physicists analyze data collected by the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider using the CMS Remote Analysis Builder and the CMS global pool to exploit the resources of the Worldwide LHC Computing Grid. Efficient use of such an extensive and expensive resource is crucial. At the same time, the CMS collaboration is committed to minimizing time to insight for every scientist, by pushing for fewer possible access restrictions to the full data sample and supports the free choice of applications to run on the computing resources. Supporting such variety of workflows while preserving efficient resource usage poses special challenges. In this paper we report on three complementary approaches adopted in CMS to improve the scheduling efficiency of user analysis jobs: automatic job splitting, automated run time estimates and automated site selection for jobs.

Serverless Supercomputing: High Performance Function as a Service for Science.

Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as serverless supercomputing. We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.

DOI: 10.1088/1742-6596/664/3/032035

Exploiting volatile opportunistic computing resources with Lobster

Analysis of high energy physics experiments using the Compact Muon Solenoid (CMS) at the Large Hadron Collider (LHC) can be limited by availability of computing resources. As a joint effort involving computer scientists and CMS physicists at Notre Dame, we have developed an opportunistic workflow management tool, Lobster, to harvest available cycles from university campus computing pools. Lobster consists of a management server, file server, and worker processes which can be submitted to any available computing resource without requiring root access.

DOI: 10.1158/1538-7445.sabcs22-p2-11-08

Abstract P2-11-08: Multimodal Prediction of Breast Cancer Recurrence Assays and Risk of Recurrence

Abstract Background: Hormone receptor positive breast cancer constitutes about 70% of newly diagnosed early-stage disease in the United States, and gene-expression based recurrence assays such as Oncotype DX (ODX) are strongly recommended by guidelines to aid in treatment decisions. However, recurrence assays are costly, time-consuming, underutilized in low resource settings, and unavailable in developing countries. Deep Learning (DL) using hematoxylin and eosin (H&E) stained digital pathology has been shown to approximate gene expression patterns for multiple cancer types, and may provide a cost-effective, fast, and scalable method to predict risk of recurrence in community settings. Methods: We first developed a model for ODX using only DL on pathology, comprised of two Xception-based modules, trained on 1,039 slides from The Cancer Genome Atlas (TCGA) tessellated into 10x magnification image tiles. The first module predicts tumor likelihood, and was trained using pathologist annotations for tumor regions versus normal stroma. The second module was trained to predict ODX score, estimated from gene expression data within TCGA. Patient-level predictions were calculated by weighting the predicted recurrence score by tumor likelihood for all tiles within a slide. Separately, ODX score was predicted from clinical variables using the University of Tennessee Nomogram, which incorporates grade, progesterone receptor, size, age, and histologic subtype. Finally, we developed a combined model by fitting a logistic regression to the DL pathologic model and the clinical nomogram predictions. Performance of the clinical nomogram, pathologic, and combined models were then compared in a single-institution external validation cohort of patients diagnosed with breast cancer between 2006 and 2020, all of whom had the commercial ODX assay run. Results: We identified 428 cases for our diverse validation cohort (69% White, 24% Black, 6% Asian, and 3% Hispanic) with mean ODX score of 18. Chemotherapy was administered for 104 (24.3%) of patients, the remaining 323 (75.4%) received endocrine therapy alone. Area under the receiver operating characteristic curve (AUROC) for prediction of high ODX score (≥ 26) of the combined model was 0.83 (95% confidence interval [CI] 0.78 – 0.89) in the validation cohort, which was higher than either the DL pathology model (AUROC 0.80, 95% CI 0.75 – 0.85, p = 0.026) or the Tennessee nomogram (AUROC 0.77, 85% CI 0.70 – 0.83, p = 0.003). Performance was similar in Black (AUROC 0.86, 95% CI 0.78 – 0.94) and White (AUROC of 0.81, 95% CI 0.74 – 0.88) subgroups. The combined model was more accurate in prediction of recurrence-free interval in patients receiving endocrine therapy (hazard ratio [HR] 2.02 per standard deviation [SD], 95% CI 1.16 – 3.52, p = 0.013, Concordance [C]-index 0.75) than the clinical nomogram (HR 1.75 per SD, 95% CI 1.09 – 2.81, p = 0.021, C-index 0.68). No model was prognostic in patients receiving chemotherapy. Pathologist review of heatmaps of DL model predictions identified lymphovascular invasion, necrosis, high grade, and infiltrative borders as features contributing to model prediction of high risk. Conclusions: DL can improve on existing clinical prediction of breast cancer with low recurrence risk. This approach could improve the speed at which treatment decisions are made due to the time-consuming nature of genomic testing and simultaneously reduce the cost of care. Given the equal performance in racial subgroups, this approach has promise for application in global health settings where genomic assays are not widely available or are prohibitively expensive. Citation Format: Frederick M. Howard, James M. Dolezal, Sara Kochanny, Galina Khramtsova, Jasmine Vickery, Andrew Srisuwananukorn, Anna Woodard, Nan Chen, Rita Nanda, Charles Perou, Olufunmilayo I. Olopade, Dezheng Huo, Alexander Pearson. Multimodal Prediction of Breast Cancer Recurrence Assays and Risk of Recurrence [abstract]. In: Proceedings of the 2022 San Antonio Breast Cancer Symposium; 2022 Dec 6-10; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2023;83(5 Suppl):Abstract nr P2-11-08.

DOI: 10.48550/arxiv.2307.11060

The Changing Role of RSEs over the Lifetime of Parsl

This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.

DOI: 10.21203/rs.3.rs-3301977/v1

Benign breast disease and breast cancer risk in African women: A case-control study

To examine the association between benign breast disease (BBD) and breast cancer (BC) in a heterogeneous population of African women.BC cases and matched controls were enrolled in three sub-Saharan African countries, Nigeria Cameroon, and Uganda, between 1998-2018. Multivariable logistic regression was used to test the association between BBD and BC. Risk factors dually associated with BBD and BC were selected. Using a parametric mediation analysis model, we assessed if selected BC risk factors were mediated by BBD.Of 6418 participants, 55.7% (3572) were breast cancer cases. 360 (5.7%) self-reported BBD. Fibroadenoma (46.8%) was the most reported BBD. Women with a self-reported history of BBD had greater odds of developing BC than those without (adjusted odds ratio [aOR] = 1.47, 95% CI: 1.13-1.91). Biopsy-confirmed BBD was associated with BC (aOR = 3.11, 95% CI: 1.78-5.44). BBD did not significantly mediate the effects of any of the selected BC risk factors.In this study, BBD was associated with BC and did not significantly mediate the effects of selected BC risk factors.

DOI: 10.18154/rwth-2015-03808

Search for disappearing tracks in proton-proton collisions at √s = 8TeV

DOI: 10.1088/1742-6596/898/5/052036

Opportunistic Computing with Lobster: Lessons Learned from Scaling up to 25k Non-Dedicated Cores

We previously described Lobster, a workflow management tool for exploiting volatile opportunistic computing resources for computation in HEP. We will discuss the various challenges that have been encountered while scaling up the simultaneous CPU core utilization and the software improvements required to overcome these challenges.

DLHub: Model and Data Serving for Science

While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the learning needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its selfservice model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.

DOI: 10.1051/epjconf/202024507046

Real-time HEP analysis with funcX, a high-performance platform for function as a service

We explore how the function as a service paradigm can be used to address the computing challenges in experimental high-energy physics at CERN. As a case study, we use funcX—a high-performance function as a service platform that enables intuitive, flexible, efficient, and scalable remote function execution on existing infrastructure—to parallelize an analysis operating on columnar data to aggregate histograms of analysis products of interest in real-time. We demonstrate efficient execution of such analyses on heterogeneous resources.

g100 °C 10 Gbit/s directly modulated laser incorporating a novel semi-insulating buried heterostructure

We describe a packaged 10 Gbit/s directly modulated laser operating at > 100 degC. The laser incorporates a novel semi-insulating buried heterostructure that minimises the internal capacitance while reducing the thermal impedance

DOI: 10.1142/s021830131001528x

THE HELIOS SPECTROMETER AND THE RADIOACTIVE BEAM PROGRAM AT ARGONNE

The near-term radioactive beam capabilities of ATLAS include radioactive beams produced in flight in a gas cell, or starting in the fall of 2009, re-accelerated beams of 252 Cf fission fragments provided by the new CARIBU injector. The availability of such exotic beams will allow for detailed studies of the single-particle aspects of nuclear structure in neutron-rich nuclei reaching out to the astrophysical r-process path by employing light-ion reactions in inverse kinematics. The HELIOS spectrometer is based on a new concept that is especially well suited for such studies. This concept was recently demonstrated using the reactions D ( 28 Si , p ) 29 Si with a (stable) 168 MeV 28 Si beam. Since then D ( 12 B , p ) 13 B , D ( 17 O , p ) 18 O , and D ( 15 C , p ) 16 C have been studied successfully. The combination of neutron-rich beams from CARIBU and the HELIOS spectrometer opens a fertile research area of precision studies of the single particle strengths and collective excitations in exotic nuclei, and is likely to have applications in other reactions as well.

DOI: 10.1088/1742-6596/898/5/052035

Use of DAGMan in CRAB3 to improve the splitting of CMS user jobs

CRAB3 is a workload management tool used by CMS physicists to analyze data acquired by the Compact Muon Solenoid (CMS) detector at the CERN Large Hadron Collider (LHC). Research in high energy physics often requires the analysis of large collections of files, referred to as datasets. The task is divided into jobs that are distributed among a large collection of worker nodes throughout the Worldwide LHC Computing Grid (WLCG).

DOI: 10.1088/1742-6596/898/8/082041

Scaling up a CMS tier-3 site with campus resources and a 100 Gb/s network connection: what could go wrong?

The University of Notre Dame (ND) CMS group operates a modest-sized Tier-3 site suitable for local, final-stage analysis of CMS data. However, through the ND Center for Research Computing (CRC), Notre Dame researchers have opportunistic access to roughly 25k CPU cores of computing and a 100 Gb/s WAN network link. To understand the limits of what might be possible in this scenario, we undertook to use these resources for a wide range of CMS computing tasks from user analysis through large-scale Monte Carlo production (including both detector simulation and data reconstruction.) We will discuss the challenges inherent in effectively utilizing CRC resources for these tasks and the solutions deployed to overcome them.

DOI: 10.1158/1538-7445.am2022-5047

Abstract 5047: Self-supervised deep learning to assess breast cancer risk

Abstract Background: Personalized breast cancer (BC) screening adjusts the imaging modality and frequency of exams according to a woman's risk of developing BC. This can lower cost and false positives by reducing unnecessary exams and has the potential to find more cancers at a curable stage. Deep learning (DL) is a class of artificial intelligence algorithms that progressively extracts higher-level representations from raw input. A critical challenge to applying DL for BC risk prediction is that images are needed from exams performed before a possible cancer diagnosis. Large longitudinal datasets with cancer labeling are relatively scarce. Recently, new self-supervised methods have been developed which do not require labeling. Instead, they learn to recognize higher-level features by comparing two augmented images and determining if they are derived from the same original image. Methods: We developed Self-supervised AI for CAncer Risk Assessment (SAICARA), a mammography-based DL model to predict BC risk. We trained SAICARA on mammograms from the Chicago Multiethnic Epidemiologic Cohort (ChiMEC). We used the momentum contrast method in pretraining to train an encoder that produces compact representations of input mammography views. We initialized the encoders with weights obtained from training on the ImageNet dataset. We continued pretraining with 223,415 chest radiographs from the CheXpert database. Finally, we used mammograms from ChiMEC without any requirements on the exam date. We used augmentations from two different mammography views to provide better positive pairs for self-supervised learning. For fine-tuning, we trained with exams from women who were known to be cancer-free with at least 100 days of follow-up, and patients diagnosed with BC at least 30 days following the exam. Optimization was performed using a negative-log likelihood loss function which was discretized by considering quantiles of the event-time distribution. Hyperparameters were tuned using a Bayesian optimization strategy implemented by Weights and Biases. We computed the concordance index and the area under the receiver-operating characteristic curve (AUC) at two years to evaluate the discriminating capacity of the predicted risk of BC. We evaluated our model using 10-fold cross-validation. Results: In the final phase of pretraining, we used 13,194 mammography exams from 2,835 women. For fine-tuning, we used 4,849 exams from 1,418 women who were known to be cancer-free at their last follow-up, and 1,760 exams from 744 women who had exams that were followed by a BC diagnosis. SAICARA achieved a mean concordance index of 0.62 (standard deviation, SD = 0.11) and a mean AUC of 0.61 (SD = 0.09). Conclusion: Self-supervised DL holds promise as a technique for improving the performance of image-based BC risk prediction models. Citation Format: Anna Woodard, Olasubomi J. Omoleye, Rachna Gupta, Fangyuan Zhao, Aarthi Koripelly, Ian Foster, Kyle Chard, Toshio F. Yoshimatsu, Yonglan Zheng, Dezheng Huo, Olufunmilayo I. Olopade. Self-supervised deep learning to assess breast cancer risk [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 5047.

DOI: 10.1158/1538-7445.am2022-1933

Abstract 1933: Independent evaluation and validation of mammography-based breast cancer risk models in a diverse patient cohort

Abstract Imaging-based machine learning models are promising tools for breast cancer risk prediction. Validating these models across diverse cohorts is necessary to establish performance and spur clinical implementation. We conducted an independent, external validation study of Mirai, a mammography-based deep learning model, using the Chicago Multiethnic Epidemiologic Cohort (ChiMEC), comprising 1671 exams from 704 cases and 4947 exams from 1437 cancer-free controls. We preprocessed images by extracting metadata from mammograms and excluded non-screening exams. Only exams with the four standard mammographic views were included. Images were converted from DICOM to PNG format using the DCMTK library. We computed the area under the receiver-operating characteristic curve (AUC) to evaluate the model’s discriminating capacity for predicting breast cancer within 1-5 years. We analyzed the entire cohort and stratified by race and hormone-receptor (HR) status. Mirai performed well in our study, but the performance is lower than that in the originally published validation of Mirai model. The AUC of the 1-year-risk is 0.72 in our full cohort, which is higher than that of the 5-year-risk (0.65). The 1-year AUC is high in African Americans but decreases over time. In contrast, the model showed lower but time-consistent AUC values in White patients. Performance is slightly better for predicting HR + compared to HR - cancers. Our results suggest that Mirai has better accuracy for predicting short-term breast cancer risk than traditional risk factor-based models, such as the Gail and Tyrer-Cuzick models. This initial evaluation revealed some performance differences by race and HR status and underscores the need for more independent validations in diverse datasets to elucidate the generalizability of image-based deep learning for breast cancer risk prediction. Table 1. Evaluation of performance of Mirai in ChiMEC Cohort Subset Case exams Control exams Harrel's C-index 1-year AUC 2-year AUC 3-year AUC 4-year AUC 5-year AUC Full cohort (MGH) 588 25267 .75 (.72 .78) .84 (.80, .87) 78 (.75, .82) .77 (.74, .80) .76 (.73, .79) .76 (.73, .79) Full cohort (ChiMEC) 1656 4765 .64 (.62, .66) .72 (.68, .75) .67 (.65, .69) .65 (.63, .67) .65 (.64, .67) .65 (.64, .67) African American 829 2174 .64 (.61, .67) .78 (.74, .82) .69 (.65, .72) .66 (.63, .69) .66 (.64, .69) .66 (.63, .68) White 711 1808 .62 (.59, .65) .63 (.57, .68) .65 (.61, .68) .63 (.60, .66) .63 (.61, .66) .64 (.61, .67) Hispanic 20 164 .65 (.45, .86) .63 (.31, .96) .74 (.51, .97) .70 (.51, .89) .67 (.50, .83) .67 (.51, .83) Asian and Native American 80 178 .59 (.49, .70) .67 (.53, .81) .62 (.52, .73) .63 (.54, .72) .62 (.52, .71) .63 (.53, .72) Hormone receptor positive 1281 4765 .65 (.62, .68) .74 (.70, .78) .68 (.66, .71) .66 (.64, .68) .66 (.64, .68 .66 (.64, .68) Hormone receptor negative 300 4765 .62 (.58, .67) .68 (.61, .75) .65 (.60, .70) .63 (.59, .67) .63 (.59, .67) .64 (.60, .67) HER2 positive 139 4765 .62 (.54, .71) .74 (.61, .86) .64 (.56, .72) .63 (.56, .69) .64 (.58, .69) .64 (.58, .69) HER2 negative 1138 4765 .65 (.62, .67) .74 (.70, .78) .68 (.65, .71) .66 (.64, .68) .66 (.64, .68) .66 (.64, .68) Triple negative 207 4765 .61 (.55, .67) .64 (.54, .74) .63 (.57, .69) .62 (.57, .67) .62 (.57, .66) .62 (.58, .67) Citation Format: Olasubomi J. Omoleye, Anna Woodard, Fangyuan Zhao, Maksim Levental, Toshio F. Yoshimatsu, Yonglan Zheng, Olufunmilayo I. Olopade, Dezheng Huo. Independent evaluation and validation of mammography-based breast cancer risk models in a diverse patient cohort [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2022; 2022 Apr 8-13. Philadelphia (PA): AACR; Cancer Res 2022;82(12_Suppl):Abstract nr 1933.

DOI: 10.48550/arxiv.2209.11631

funcX: Federated Function as a Service for Science

funcX is a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. Unlike centralized FaaS systems, funcX decouples the cloud-hosted management functionality from the edge-hosted execution functionality. funcX's endpoint software can be deployed, by users or administrators, on arbitrary laptops, clouds, clusters, and supercomputers, in effect turning them into function serving systems. funcX's cloud-hosted service provides a single location for registering, sharing, and managing both functions and endpoints. It allows for transparent, secure, and reliable function execution across the federated ecosystem of endpoints--enabling users to route functions to endpoints based on specific needs. funcX uses containers (e.g., Docker, Singularity, and Shifter) to provide common execution environments across endpoints. funcX implements various container management strategies to execute functions with high performance and efficiency on diverse funcX endpoints. funcX also integrates with an in-memory data store and Globus for managing data that may span endpoints. We motivate the need for funcX, present our prototype design and implementation, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than 130 000 concurrent workers. We show that funcX's container warming-aware routing algorithm can reduce the completion time for 3000 functions by up to 61% compared to a randomized algorithm and the in-memory data store can speed up data transfers by up to 3x compared to a shared file system.

Effective Field Theory Interpretation for Measurements of Top Quark Pair-Production in Association with a W or Z Boson

Parsl: Scalable parallel scripting in python

DOI: 10.1158/1538-7445.am2020-5468

Abstract 5468: Gene fusions in breast cancer in Nigerian women

Abstract Background: As a chromosomal rearrangement event, gene fusion plays a critical role in the pathogenesis of cancer by creating potentially oncogenic chimeric proteins. However, gene fusions have been understudied in breast cancer patients of African ancestry, who are often diagnosed at a younger age and with more aggressive molecular features of cancer than patients of other races/ethnicities. Methods: Ninety-six women diagnosed with invasive breast cancer were recruited from Ibadan, Nigeria with mean age at diagnosis of 51.6 ± 12.4 years. Primary tumors were collected, of which 62 (64.6%) were hormone receptor negative (HR-) by immunohistochemistry, and 31 (32.3%) were basal-like subtype by PAM50 classification. Paired-end reads from RNA-seq on these tumors were used for gene fusion detection by three programs, STAR-Fusion, STAR-SEQR, and Arriba. To increase specificity, we applied an ensembling method by selecting fusions identified by at least two of the three callers. Multiple filters were applied to fusion candidates to remove likely false positives, including fusions containing genes of mitochondrial origin, fusions consisting of pairs of paralogues or orthologs, fusions involving HLA genes, and fusions found in several databases of non-cancer tissues. To investigate potentially druggable fusion transcripts, we compared our call set with the OncoKB precision oncology knowledge base. Results: STAR-Fusion, STAR-SEQR, and Arriba identified 682, 4529, and 3056 unique fusion transcripts, respectively. Following application of the ensembling method and filtering to select final fusions, 709 unique fusions were identified. Comparison with 13 databases and published papers identified 62 of these fusions as previously reported. The mean fusion burden per sample was 7.7 ± 7.4. The fusion burden per sample was significantly smaller for tumors classified as Luminal A subtypes by PAM50 classification than Luminal B, Basal, and HER2 subtypes (P &lt; 0.03) as well as normal like (P &lt; 0.01). Fusion burden was highest in HER2 tumors (NS), consistent with previous reports. The number of fusion transcripts per sample did not differ significantly according to the patient's age. Ninety-six percent of fusion transcripts were only identified in a single sample, including the ETV6-NTRK3 fusion, previously reported in secretory breast carcinoma, and BCR-ABL1, both of which are targeted by a drug identified in OncoKB. The most commonly observed fusion, EDDM13–ZNF71, appeared in 4.2% (4/96) of samples. Conclusion: The vast majority of breast cancer samples from Nigerian women demonstrate unique fusion transcripts in expressed RNA. Fusion burden per sample was related to PAM50 classification. Future work will incorporate functional studies of the recurrent gene fusion events identified in our cohort. We will additionally focus on validation and refinement of our approach to detecting gene fusions in this understudied population, in order to better identify patients who can benefit from new therapies. Citation Format: Anna Elizabeth Woodard, Toshio F. Yoshimatsu, WABCS Working Group, Jason J. Pitt, Yonglan Zheng, Olufunmilayo I. Olopade. Gene fusions in breast cancer in Nigerian women [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 5468.

DOI: 10.1101/2020.10.28.359240

Whole-genome analysis of Nigerian patients with breast cancer reveals ethnic-driven somatic evolution and distinct genomic subtypes

Abstract Black women of African ancestry experience more aggressive breast cancer with higher mortality rates than White women of European ancestry. Although inter-ethnic germline variation is known, differential somatic evolution has not been investigated in detail. Analysis of deep whole genomes of 97 breast tumors, with RNA-seq in a subset, from indigenous African patients in Nigeria in comparison to The Cancer Genome Atlas (n=76) revealed a higher rate of genomic instability and increased intra-tumoral heterogeneity as well as a unique genomic subtype defined by early clonal GATA3 mutations and a 10.5-year younger age at diagnosis. We also found evidence for non-coding mutations in two novel drivers ( ZNF217 and SYPL1 ) and a novel INDEL signature strongly associated with African ancestry proportion. This comprehensive analysis of an understudied population underscores the need to incorporate diversity of genomes as a key parameter in fundamental research with potential to tailor clinical intervention and promote equity in precision oncology care.

DOI: 10.48550/arxiv.1811.11213

DLHub: Model and Data Serving for Science

While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the "learning systems" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its selfservice model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.

DOI: 10.1158/1538-7445.sabcs20-ps18-12

Abstract PS18-12: Comparative analysis of differential gene expression by ancestry using primary breast cancers from Nigeria and the cancer genome atlas (TCGA)

Abstract Introduction: Breast cancers differ between genomic and transcriptomic features by ancestry within the TCGA, but current understanding of how gene expression differs across global ancestral populations is extremely limited. We hypothesized that differential expression performed by ancestry and geography may provide insight into population-specific, clinically relevant expression patterns. Objective: To compare differentially expressed protein-coding genes and pathways among primary breast tumors of Nigerian origin versus African- and European-American ancestry in TCGA Methods: We analyzed an integrated dataset of RNA-seq from 93 women in Nigeria, 31 African-ancestry women (TCGA AA), and 39 European-ancestry women from TCGA (TCGA EA) with whole-genome data. Ancestry within TCGA was classified by principal component analysis, with African ancestry as &gt;50% contribution and European ancestry as &gt;90% contribution. RNA was obtained from tumors in Nigeria using Qiagen PAXgene kits. A STAR/HTSeq pipeline generated read counts. To optimize assay-associated batch effects, we performed differential expression within each PAM50 subtype using limma-voom with quantile normalization. Significance was defined as a &gt; 1.5-fold change in gene expression (log2 scale) with a false-discovery-rate-adjusted p-value of 0.05. Pathway analysis was performed via Gene Ontology and the Web-Based Gene Set Analysis Toolkit. We also compared gene expression, claudin-low (30 genes) and VEGF (13 genes) signatures to an additional set of 189 primary breast cancers from Nigeria assayed on the NanoString nCounter System using a custom Nano110 probe set (PAM50 + claudin-low & VEGF genes). RNA for these cancers was isolated from paraffin-embedded tumor using the Roche High Pure paraffin kit. Results: Differential expression was performed pairwise across ancestry groups within PAM50 subtypes (see Table). Fewer genes were differentially expressed, and fold change smaller across shared genes, when comparing Nigerian vs. TCGA AA versus Nigerian vs. TCGA EA comparisons, supporting quantile normalization. The strongest gene ontology pathway associations, seen for all subtypes, were intracellular protein targeting and viral gene expression. The epigenetic regulation pathway was significantly associated with comparisons in Basal-like tumors (padj=1.54e-7 for TCGA EA, padj=0.001 for TCGA AA). The PI3K-Akt pathway was significantly associated with Nigerian vs. TCGA-EA within Luminal A (padj=0.006). The Nanostring cohort shared a similar distribution of PAM50 subtypes (see Table, X2 p=0.21). We found concordance in both Nigerian cohorts of relative claudin-low and VEGF expression signature patterns across subtypes. Of 17 genes with significant differential expression by ancestry in the Nanostring dataset, 9 (ADM, ACTB, BIRC5, CDC6, CENPF, MKI67, MPP1, RAD17, and VEGFA) showed significant differential expression by ancestry in the PAXgene dataset. Discussion: This is one of the first analyses of differential gene expression across tumors from a global population. We identified differential pathways in breast tumors between African and European ancestry populations to target for future work. We also validated several ancestry-specific genes across platforms with potential clinical relevance. Understanding how molecular features differ across global populations will improve precision oncology for all patients. PAM50ClassificationNigerian: PAXgene (n=93)TCGA AA (n=31)TCGA EA (n=39)Nigerian: Nanostring (n=189)Nigerian (PAXgene) vs. TCGA EA ComparisonNigerian (PAXgene) vs. TCGA AAComparisonBasal-like41 (42.8%)23 (74.1%)17 (43.6%)78 (41.3%)4893 genes4687 genesHer2-enriched27 (28.1%)05 (12.8%)31 (16.4%)961 genesN/ALuminal A14 (14.5%)4 (12.9%)8 (20.5%)39 (20.6%)2596 genes480 genesLuminal B11 (11.4%)4 (12.9%)9 (23.1%)25 (13.2%)2112 genes222 genes Citation Format: Padma Sheila Rajagopal, Yi-Hsuan S Tsai, Ashley Hardeman, Ian Hurley, Aminah Sallam, Yonglan Zheng, Toshio Yoshimatsu, Anna Woodard, Dezheng Huo, Guimin Gao, Charles M Perou, Joel S Parker, Mengjie Chen, Olufunmilayo I Olopade. Comparative analysis of differential gene expression by ancestry using primary breast cancers from Nigeria and the cancer genome atlas (TCGA) [abstract]. In: Proceedings of the 2020 San Antonio Breast Cancer Virtual Symposium; 2020 Dec 8-11; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2021;81(4 Suppl):Abstract nr PS18-12.

DOI: 10.1158/1538-7445.sabcs20-pd7-06

Abstract PD7-06: Unstable mutational profile and heterogeneity of residual breast tumor following neoadjuvant therapy from comprehensive genomic and transcriptomic sequencing

Abstract Background: In patients with early breast cancer, neoadjuvant therapy is widely performed as standard of care. While the molecular targeting strategy makes progress in HER2 positive breast cancer, the optimal regimens for ER+/HER2- breast cancer (BC) and triple negative breast cancer (TNBC) remain undecided. Furthermore, there are few strategies for patients who do not achieve complete pathological response (pCR) who have worse prognosis. To address these unmet clinical needs, there is urgent need to examine the genomic alterations and immune microenvironment of residual tumors after neoadjuvant therapy. Here, we conducted a comprehensive analysis integrating genomic, transcriptomic, and clinical data to investigate the difference between primary breast cancer and residual disease. Materials and Methods: Using the large prospectively ascertained ethnically diverse Chicago Multi-Ethnic (ChiMEC) cohort of 562 participants with integrated genomic data, we identified 176 patients with breast cancer who underwent sequencing with Tempus xT next-generation sequencing panel including DNA- and whole-transcriptome RNA-sequencing. These included 131 primary breast tumors and 45 residual tumors after neoadjuvant therapy. We compared mutation rates between primary tumor and residual tumor in ER+/HER2- BC and TNBC. We also investigated homologous recombination deficiency (HRD) scores, tumor mutational burden (TMB), degree of immune infiltration, and microsatellite instability (MSI), and their association with survival. Results: Out of the 176 patients, there were 72 ER+/HER2- BC, 42 TNBC, and 44 HER2 positive cases. Among patients with HR+/HER2- BC, residual tumors had higher mutation rates in PIK3CA (61% vs 33%), CDH1 (33% vs 17%), CCND1 (39% vs 7%), FGF3 (28% vs 0%), FGF4 (33% vs 2%), FGF19 (33% vs 2%), and GATA3 (44% vs 7%) than primary tumors, but lower rates in TP53 (28% vs 48%), MAP3K1 (6% vs 40%), KMT2D (0% vs 33%), MCL1 (0% vs 30%), SPEN (6% vs 20%), ZFHX3 (0% vs 20%), ARID1A (11% vs 16%), KMT2C (0% vs 19%), LZTR1 (0% vs 19%), BCORL1 (6% vs 17%), NOTCH3 (6% vs 17%), FAT1 (0% vs 17%) and LPR1B (0% vs 17%). Relative to primary TNBC tumors, residual TNBC tumors exhibited higher mutation rates in PTEN (10% vs 3%), CCNE1 (10% vs 3%), CIC (10% vs 0%), and KMT2D (10% vs 3%). Conversely, residual TNBC tumors had relatively lower rates of MCL1 (0% vs 21%), RB1 (0% vs 12%), CDH1 (0% vs 12%), KMT2C (11% vs 18%), and PIK3CA (0% vs 15%), CDKN1B (0% vs 12%) and ETV6 (0% vs 12%). There was a significant trend of higher TMB (&gt;5.0 mutations/megabase, m/MB) associated with improved disease-free survival among the 176 patients (P=0.031). TMB was not significantly different between primary tumors and residual tumors, or across subtypes. Microsatellite instability (MSI) status was high in 3 patients (1.6%) and equivocal in 5 patients (2.7%). TMB in MSI-high or MSI-equivocal tumors was significantly higher than TMB in tumors with microsatellite stability (37.7m/MB vs 5.3m/MB, P&lt;0.001). HRD scores were highest in TNBC, and lowest in ER+/HER2- breast cancers (P=0.022). There was no significant association between the HRD scores and survival. To date, tumors from 106 patients have undergone immune profiling, with estimation of immune cell infiltration, macrophages, B cells, CD4, CD8, and NK cells. Initial analysis showed no association between immune profiling and survival. Conclusion: These comprehensive analyses demonstrated that mutation status in HR+/HER2- breast cancers and TNBC differs between primary and residual tumors after neoadjuvant therapy. We identified genomic alterations and pathways in residual tumors that could be further explored as potential targets in the adjuvant setting to improve long term outcomes for patients who do not achieve pCR. Citation Format: Minoru Miyashita, Masaya Hattori, Yonglan Zheng, Toshio Yoshimatsu, Joshua SK Bell, Padma Sheila Rajagopal, Anna Woodard, Jean Baptiste Reynier, Elisabeth Sveen, Galina Khramtsova, Fang Liu, Abiola Ibraheem, Gini Fleming, Nora Jaskowiak, Rita Nanda, Benjamin Leibowitz, Nike Beaubier, Kevin White, Dezheng Huo, Olufunmilayo I Olopade. Unstable mutational profile and heterogeneity of residual breast tumor following neoadjuvant therapy from comprehensive genomic and transcriptomic sequencing [abstract]. In: Proceedings of the 2020 San Antonio Breast Cancer Virtual Symposium; 2020 Dec 8-11; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2021;81(4 Suppl):Abstract nr PD7-06.

DOI: 10.1158/1538-7445.sabcs20-ss1-04

Abstract SS1-04: Comprehensive genomic and transcriptomic profiling of molecular subtypes reveal ancestral differences in the activity of signaling pathways between patients with African and European ancestry

Abstract Background: Breast cancer demonstrates heterogeneity in biological features, and the therapeutic strategy depends on tumor subtype. African-Ancestry (AA) patients experience a disproportionally high rate of triple negative breast cancer (TNBC) and worse outcomes than European-Ancestry (EA) patients. However, the biological drivers causing this disparity between ancestral populations are not deeply understood. To address the issue, we performed genomic and transcriptomic sequencing of breast tumors for comparison between AA and EA patients according to breast cancer molecular subtypes. Materials and Methods: The study included 221 AA and 341 EA patients. Collected samples underwent the Tempus xT next-generation sequencing panel. Following DNA-panel and whole-transcriptome RNA-sequencing, we compared gene mutation rates, homologous recombination deficiency (HRD) scores, degree of immune infiltration and tumor mutational burden (TMB) between ethnicities and molecular subtypes. Additionally, differences between the activity of relevant signaling pathways were evaluated from RNA-sequencing data. Results: Relative to EA TNBC, AA TNBC tumors exhibited higher mutation rates in TP53 (94% vs 86%), KMT2C (17% vs 9%), APOB (19% vs 10%), BRCA2 (11% vs 5%), EP300 (8% vs 2%), NOTCH1 (12% vs 4%), and EGFR (11% vs 4%). Conversely, AA TNBC tumors had relatively lower rates of PIK3CA (10% vs 18%), RB1 (8% vs 15%), and NF1 (5% vs 11%) mutations. Among patients with HR+/HER2- breast cancer, AA tumors had higher mutation rates in CCND1 (23% vs 10%) and FGF3 (16% vs 10%) than EA tumors, but lower rates in TP53 (32% vs 39%). HRD scores were higher in TNBC and HR-/HER2+ tumors compared with the other subtypes (P&lt;0.001). However, there was no significant difference between the HRD scores of AA and EA tumors within TNBC or HR-/HER2+ populations. The highest percentage of immune infiltration was observed in HR-/HER2+ tumors (P=0.036), with no difference between AA or EA groups. TMB did not differ across ancestries or subtypes. Although immune pathways were generally more active in TNBC compared to the other subtypes, there was no difference in pathway-specific immune activation between ethnicities. The G2M and E2F pathways were significantly more active in TNBC (P&lt;2e-16 for both), in particular more active in AA than EA tumors (G2M, P=0.035; E2F, P=0.037). On the other hand, PI3K, ROS, and xenobiotic metabolism (XM) pathways were significantly less active in TNBC compared to the other subtypes (PI3K, P=2.4e-5; ROS, P=0.014; and XM, P&lt;2e-16). Furthermore, these pathways were significantly less active in AA than EA tumors across all subtypes (PI3K, P=0.026; ROS, P=0.00035; and XM, P=0.00041) and within TNBC (PI3K P=0.012, ROS P=0.014, and XM P=0.00018). Conclusion: These data demonstrate significant differences in breast tumor heterogeneity and mutation spectrum in TNBC and HR+/HER2- breast cancers between AA and EA patients. Ancestral differences were also observed in the activity of relevant signaling pathways for TNBC. Overall, the results identify previously unexplored pathways and molecular phenotypes of aggressive disease, providing opportunities for development of more effective biomarker informed treatment of breast cancer in diverse populations. Citation Format: Minoru Miyashita, Joshua SK Bell, Yonglan Zheng, Toshio Yoshimatsu, Padma Sheila Rajagopal, Anna Woodard, Jean Baptiste Reynier, Elisabeth Sveen, Galina Khramtsova, Fang Liu, Abiola Ibraheem, Gini Fleming, Nora Jaskowiak, Rita Nanda, Benjamin Leibowitz, Nike Beaubier, Kevin White, Dezheng Huo, Olufunmilayo I Olopade. Comprehensive genomic and transcriptomic profiling of molecular subtypes reveal ancestral differences in the activity of signaling pathways between patients with African and European ancestry [abstract]. In: Proceedings of the 2020 San Antonio Breast Cancer Virtual Symposium; 2020 Dec 8-11; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2021;81(4 Suppl):Abstract nr SS1-04.

DOI: 10.1145/3463478.3463486

Extended Abstract

Parsl is a parallel programming library for Python that aims to make it easy to specify parallelism in programs and to realize that parallelism on arbitrary parallel and distributed computing systems. Parsl relies on developers annotating Python functions-wrapping either Python or external applications-to indicate that these functions may be executed concurrently. Developers can then link together functions via the exchange of data. Parsl establishes a dynamic dependency graph and sends tasks for execution on connected resources when dependencies are resolved. Parsl's runtime system enables different compute resources to be used, from laptops to supercomputers, without modification to the Parsl program.

DOI: 10.5281/zenodo.4660697

CoffeaTeam/coffea: Release v0.7.2