ϟ

René Caspart

Here are all the papers by René Caspart that you can download and read on OA.mg.
René Caspart’s last known institution is . Download René Caspart PDFs here.

Claim this Profile →

DOI: 10.1088/1742-6596/2438/1/012039

Extending the distributed computing infrastructure of the CMS experiment with HPC resources

Abstract Particle accelerators are an important tool to study the fundamental properties of elementary particles. Currently the highest energy accelerator is the LHC at CERN, in Geneva, Switzerland. Each of its four major detectors, such as the CMS detector, produces dozens of Petabytes of data per year to be analyzed by a large international collaboration. The processing is carried out on the Worldwide LHC Computing Grid, that spans over more than 170 compute centers around the world and is used by a number of particle physics experiments. Recently the LHC experiments were encouraged to make increasing use of HPC resources. While Grid resources are homogeneous with respect to the used Grid middleware, HPC installations can be very different in their setup. In order to integrate HPC resources into the highly automatized processing setups of the CMS experiment a number of challenges need to be addressed. For processing, access to primary data and metadata as well as access to the software is required. At Grid sites all this is achieved via a number of services that are provided by each center. However at HPC sites many of these capabilities cannot be easily provided and have to be enabled in the user space or enabled by other means. At HPC centers there are often restrictions regarding network access to remote services, which is again a severe limitation. The paper discusses a number of solutions and recent experiences by the CMS experiment to include HPC resources in processing campaigns.

DOI: 10.48550/arxiv.2212.01698

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

With the rise of AI in recent years and the increase in complexity of the models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerators and the use of large compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, energy efficiency plays an important role for AI model developers and infrastructure operators alike. The energy consumption of AI workloads depends on the model implementation and the utilized hardware. Therefore, accurate measurements of the power draw of AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. To this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of compute nodes. Our results indicate that 1. deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately - while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is that the information on energy consumption is available to all users of the supercomputer, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.

DOI: 10.1051/epjconf/201921404007

Advancing throughput of HEP analysis work-flows using caching concepts

High throughput and short turnaround cycles are core requirements for efficient processing of data-intense end-user analyses in High Energy Physics (HEP). Together with the tremendously increasing amount of data to be processed, this leads to enormous challenges for HEP storage systems, networks and the data distribution to computing resources for end-user analyses. Bringing data close to the computing resource is a very promising approach to solve throughput limitations and improve the overall performance. However, achieving data locality by placing multiple conventional caches inside a distributed computing infrastructure leads to redundant data placement and inefficient usage of the limited cache volume. The solution is a coordinated placement of critical data on computing resources, which enables matching each process of an analysis work-flow to its most suitable worker node in terms of data locality and, thus, reduces the overall processing time. This coordinated distributed caching concept was realized at KIT by developing the coordination service NaviX that connects an XRootD cache proxy infrastructure with an HTCondor batch system. We give an overview about the coordinated distributed caching concept and experiences collected on prototype system based on NaviX.

DOI: 10.48550/arxiv.2207.03394

MuRiT: Efficient Computation of Pathwise Persistence Barcodes in Multi-Filtered Flag Complexes via Vietoris-Rips Transformations

Multi-parameter persistent homology naturally arises in applications of persistent topology to data that come with extra information depending on additional parameters, like for example time series data. We introduce the concept of a Vietoris-Rips transformation, a method that reduces the computation of the one-parameter persistent homology of pathwise subcomplexes in multi-filtered flag complexes to the computation of the Vietoris-Rips persistent homology of certain semimetric spaces. The corresponding pathwise persistence barcodes track persistence features of the ambient multi-filtered complex and can in particular be used to recover the rank invariant in multi-parameter persistent homology. We present MuRiT, a scalable algorithm that computes the pathwise persistence barcodes of multi-filtered flag complexes by means of Vietoris-Rips transformations. Moreover, we provide an efficient software implementation of the MuRiT algorithm which resorts to Ripser for the actual computation of Vietoris-Rips persistence barcodes. To demonstrate the applicability of MuRiT to real-world datasets, we establish MuRiT as part of our CoVtRec pipeline for the surveillance of the convergent evolution of the coronavirus SARS-CoV-2 in the current COVID-19 pandemic.

DOI: 10.1007/978-3-031-23220-6_8

Precise Energy Consumption Measurements of Heterogeneous Artificial Intelligence Workloads

With the rise of artificial intelligence (AI) in recent years and the subsequent increase in complexity of the applied models, the growing demand in computational resources is starting to pose a significant challenge. The need for higher compute power is being met with increasingly more potent accelerator hardware as well as the use of large and powerful compute clusters. However, the gain in prediction accuracy from large models trained on distributed and accelerated systems ultimately comes at the price of a substantial increase in energy demand, and researchers have started questioning the environmental friendliness of such AI methods at scale. Consequently, awareness of energy efficiency plays an important role for AI model developers and hardware infrastructure operators likewise. The energy consumption of AI workloads depends both on the model implementation and the composition of the utilized hardware. Therefore, accurate measurements of the power draw of respective AI workflows on different types of compute nodes is key to algorithmic improvements and the design of future compute clusters and hardware. Towards this end, we present measurements of the energy consumption of two typical applications of deep learning models on different types of heterogeneous compute nodes. Our results indicate that 1. contrary to common approaches, deriving energy consumption directly from runtime is not accurate, but the consumption of the compute node needs to be considered regarding its composition; 2. neglecting accelerator hardware on mixed nodes results in overproportional inefficiency regarding energy consumption; 3. energy consumption of model training and inference should be considered separately – while training on GPUs outperforms all other node types regarding both runtime and energy consumption, inference on CPU nodes can be comparably efficient. One advantage of our approach is the fact that the information on energy consumption is available to all users of the supercomputer and not just those with administrator rights, enabling an easy transfer to other workloads alongside a raise in user-awareness of energy consumption.

DOI: 10.1051/epjconf/202024507007

Setup and commissioning of a high-throughput analysis cluster

Current and future end-user analyses and workflows in High Energy Physics demand the processing of growing amounts of data. This plays a major role when looking at the demands in the context of the High-Luminosity-LHC. In order to keep the processing time and turn-around cycles as low as possible analysis clusters optimized with respect to these demands can be used. Since hyper converged servers offer a good combination of compute power and local storage, they form the ideal basis for these clusters. In this contribution we report on the setup and commissioning of a dedicated analysis cluster setup at Karlsruhe Institute of Technology. This cluster was designed for use cases demanding high data-throughput. Based on hyper converged servers this cluster offers 500 job slots and 1 PB of local storage. Combined with the 100 Gb network connection between the servers and a 200 Gb uplink to the Tier-1 storage, the cluster can sustain a data-throughput of 1 PB per day. In addition, the local storage provided by the hyper converged worker nodes can be used as cache space. This allows employing of caching approaches on the cluster, thereby enabling a more efficient usage of the disk space. In previous contributions this concept has been shown to lead to an expected speedup of 2 to 4 compared to conventional setups.

DOI: 10.1051/epjconf/202125102059

Opportunistic transparent extension of a WLCG Tier 2 center using HPC resources

Computing resource needs are expected to increase drastically in the future. The HEP experiments ATLAS and CMS foresee an increase of a factor of 5-10 in the volume of recorded data in the upcoming years. The current infrastructure, namely the WLCG, is not sufficient to meet the demands in terms of computing and storage resources. The usage of non HEP specific resources is one way to reduce this shortage. However, using them comes at a cost: First, with multiple of such resources at hand, it gets more and more diffcult for the single user, as each resource normally requires its own authentication and has its own way of accessing it. Second, as they are not specifically designed for HEP workflows, they might lack dedicated software or other necessary services. Allocating the resources at the different providers can be done by COBalD/TARDIS, developed at KIT. The resource manager integrates resources on demand into one overlay batch system, providing the user with a single point of entry. The software and services, needed for the communities workflows, are transparently served through containers. With this, an HPC cluster at RWTH Aachen University is dynamically and transparently integrated into a Tier 2 WLCG resource, virtually doubling its computing capacities.

DOI: 10.5445/ir/1000076416

Confining the Higgs sector via the investigation of di-tau final states with LHC Run II data

Erweiterungen des Higgs-Sektors im Standardmodell, wie zum Beispiel Modelle mit zwei Higgs-Dubletts, etwa das Minimale Supersymmetrische Standardmodell, sagen weitere Higgs-Bosonen voraus. In diesen Erweiterungen ist die Kopplung der weiteren Higgs-Bosonen an down-artige Fermionen, wie etwa $\tau$-Leptonen, erhoht in einem grosen Phasenraum. Daraus folgend ist der Zerfall dieser Higgs-Bosonen in Paare von $\tau$-Leptonen einer der viel versprechensden Kanale fur die Suche nach neuer Physik. In dieser Arbeit wird dieser Zerfallkanal mit Daten, die mit dem CMS Experiment im Jahr 2016 bei einer Schwerpunktsenergie von $\sqrt{s}=13\,\text{TeV}$ aufgenommen wurden, untersucht. Die Analysemethoden, Selektionskriterien und das resultierende Unsicherheitsmodell werden dargestellt. Es wurde kein Hinweis auf zusatzliche Higgs-Bosonen gefunden. Entsprechende Ausschlussgrenzen werden auf das Produkt aus Wirkungsquerschnitt und Verzweigungsverhaltnis fur weitere Higgs-Bosonen bestimmt. Diese Ausschlussgrenzen schranken den Phasenraum fur mogliche Abweichungen vom Higgs-Sektor des Standardmodells ein. Ausschlussgrenzen fur exemplarische Szenarien werden bestimmt und Moglichkeiten zur Interpretation der Ergebnisse dieser Analyse im Kontext von weiteren Szenarien werden diskutiert.

Extension of searches for additional MSSM Higgs boson with the CMS experiment towards the NMSSM

Next-to-leading order reweighting method for simulated processes of gluon fusion Higgs boson production

DOI: 10.1051/epjconf/201921403027

Modeling and Simulation of Load Balancing Strategies for Computing in High Energy Physics

The amount of data to be processed by experiments in high energy physics (HEP) will increase tremendously in the coming years. To cope with this increasing load, most efficient usage of the resources is mandatory. Furthermore, the computing resources for user jobs in HEP will be increasingly distributed and heterogeneous, resulting in more difficult scheduling due to the increasing complexity of the system. We aim to create a simulation for the WLCG helping the HEP community to solve both challenges: a more efficient utilization of the grid and coping with the rising complexity of the system. There is currently no simulation in existence which helps the operators of the grid to make the correct decisions while optimizing the load balancing strategy. This paper presents a proof of concept in which the computing jobs at the Tier 1 center GridKa are modeled and simulated. To model the computing jobs we extended the Palladio simulator with a mechanism to simulate load balancing strategies. Furthermore, we implemented an automated model parameter analysis and model creation. Finally, the simulation results are validated using real-word performance data. Our results suggest that simulating larger parts of the grid is feasible and can help to optimize the utilization of the grid.

Dynamic Computing Resource Extension Using COBalD/TARDIS

DOI: 10.1051/epjconf/202125102039

Transparent Integration of Opportunistic Resources into the WLCG Compute Infrastructure

The inclusion of opportunistic resources, for example from High Performance Computing (HPC) centers or cloud providers, is an important contribution to bridging the gap between existing resources and future needs by the LHC collaborations, especially for the HL-LHC era. However, the integration of these resources poses new challenges and often needs to happen in a highly dynamic manner. To enable an effective and lightweight integration of these resources, the tools COBalD and TARDIS are developed at KIT. In this contribution we report on the infrastructure we use to dynamically offer opportunistic resources to collaborations in the World Wide LHC Computing Grid (WLCG). The core components are COBalD/TARDIS, HTCondor, CVMFS and modern virtualization technology. The challenging task of managing the opportunistic resources is performed by COBalD/TARDIS. We showcase the challenges, employed solutions and experiences gained with the provisioning of opportunistic resources from several resource providers like university clusters, HPC centers and cloud setups in a multi VO environment. This work can serve as a blueprint for approaching the provisioning of resources from other resource providers.