ϟ

F. Pantaleo

Here are all the papers by F. Pantaleo that you can download and read on OA.mg.
F. Pantaleo’s last known institution is . Download F. Pantaleo PDFs here.

Claim this Profile →

DOI: 10.1016/j.nima.2022.167962

A compact, light scintillating fiber tracker with SiPM readout

We present the concept of a novel compact and light tracker based on arrays of plastic scintillating fibers readout with Silicon Photomultipliers (SiPMs). The tracker will be composed of multiple planes, with the fibers in each plane oriented perpendicularly to those in the adjacent planes. Each plane will consist of two staggered layers of fibers, having a round cross section with 500 μm diameter and arranged in a close packed configuration. Scintillation photons produced in the fibers will be collected by SiPM arrays with 250 μm strip pitch located at one end of the fibers. This configuration will ensure an accurate spatial resolution and a fast response, while keeping a reduced material budget. Hence, this detector will be suitable to track low energy particles and will be able to efficiently detect the Compton scattered electrons produced by gamma rays with energies down to 100 keV. We built a reduced scale tracker prototype, using Hamamatsu 128 channel SiPM arrays and 32 channel PETIROC2A front end ASICs readout. The latter are controlled by a custom data acquisition board with self triggering capabilities. We tested this prototype with cosmic rays, radioactive sources and accelerated particle beams.

DOI: 10.1088/1748-0221/9/02/c02023

NaNet: a flexible and configurable low-latency NIC for real-time trigger systems based on GPUs

NaNet is an FPGA-based PCIe X8 Gen2 NIC supporting 1/10 GbE links and the custom 34 Gbps APElink channel. The design has GPUDirect RDMA capabilities and features a network stack protocol offloading module, making it suitable for building low-latency, real-time GPU-based computing systems. We provide a detailed description of the NaNet hardware modular architecture. Benchmarks for latency and bandwidth for GbE and APElink channels are presented, followed by a performance analysis on the case study of the GPU-based low level trigger for the RICH detector in the NA62 CERN experiment, using either the NaNet GbE and APElink channels. Finally, we give an outline of project future activities.

DOI: 10.3389/fdata.2020.591315

CLUE: A Fast Parallel Clustering Algorithm for High Granularity Calorimeters in High-Energy Physics

One of the challenges of high granularity calorimeters, such as that to be built to cover the endcap region in the CMS Phase-2 Upgrade for HL-LHC, is that the large number of channels causes a surge in the computing load when clustering numerous digitized energy deposits (hits) in the reconstruction stage. In this article, we propose a fast and fully parallelizable density-based clustering algorithm, optimized for high-occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster. The algorithm uses a grid spatial index for fast querying of neighbors and its timing scales linearly with the number of hits within the range considered. We also show a comparison of the performance on CPU and GPU implementations, demonstrating the power of algorithmic parallelization in the coming era of heterogeneous computing in high-energy physics.

DOI: 10.3389/fdata.2020.601728

Heterogeneous Reconstruction of Tracks and Primary Vertices With the CMS Pixel Tracker

The High-Luminosity upgrade of the Large Hadron Collider (LHC) will see the accelerator reach an instantaneous luminosity of 7 × 10 34 cm −2 s −1 with an average pileup of 200 proton-proton collisions. These conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centers are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.

DOI: 10.1051/epjconf/202024505009

Bringing heterogeneity to the CMS software framework

The advent of computing resources with co-processors, for example Graphics Processing Units (GPU) or Field-Programmable Gate Arrays (FPGA), for use cases like the CMS High-Level Trigger (HLT) or data processing at leadership-class supercomputers imposes challenges for the current data processing frameworks. These challenges include developing a model for algorithms to offload their computations on the co-processors as well as keeping the traditional CPU busy doing other work. The CMS data processing framework, CMSSW, implements multithreading using the Intel Threading Building Blocks (TBB) library, that utilizes tasks as concurrent units of work. In this paper we will discuss a generic mechanism to interact effectively with non-CPU resources that has been implemented in CMSSW. In addition, configuring such a heterogeneous system is challenging. In CMSSW an application is configured with a configuration file written in the Python language. The algorithm types are part of the configuration. The challenge therefore is to unify the CPU and co-processor settings while allowing their implementations to be separate. We will explain how we solved these challenges while minimizing the necessary changes to the CMSSW framework. We will also discuss on a concrete example how algorithms would offload work to NVIDIA GPUs using directly the CUDA API.

DOI: 10.1016/j.nima.2022.167040

A light tracker based on scintillating fibers with SiPM readout

We have developed a novel light tracker based on plastic scintillating fiber arrays readout with Silicon Photomultipliers (SiPMs). The tracker consists of multiple planes, with the fibers in each plane oriented perpendicularly to those in the adjacent plane, in order to allow 3D track reconstruction. The fibers in each plane have round cross sections, with a diameter of 500μm, and are arranged in two staggered layers in a close-packed configuration. The fibers are readout by means of SiPM arrays with a 250μm strip pitch placed at one of their ends. Scintillating fibers allow a reduced material budget while providing a good spatial resolution and a fast response. This design is therefore suitable to track low-energy particles, such as the lowest energy cosmic rays or the electrons produced in Compton scatterings of gamma rays with energies down to 100 keV. We have built a detector prototype, equipped with Hamamatsu 128-channel SiPM arrays, readout with 32-channel PETIROC2A front-end ASICs. These ASICs are controlled by a custom data acquisition system board equipped with Xilinx Kintex-7 FPGA with self-triggering capabilities. The prototype has been tested with particle beams, cosmic rays and radioactive sources. The tracker design will be presented and performance of the prototype will be discussed.

DOI: 10.1088/1742-6596/2438/1/012015

Clustering in the Heterogeneous Reconstruction Chain of the CMS HGCAL Detector

Abstract We present an important milestone for the CMS High Granularity Calorimeter (HGCAL) event reconstruction: the deployment of the GPU clustering algorithm (CLUE) to the CMS software. The connection between GPU CLUE and the preceding GPU calibration step is thus made possible, further extending the heterogeneous chain of HGCAL’s reconstruction framework. In addition to improvements brought by CLUE’s deployment, new recursive device kernels are added to efficiently calculate the position and energy of CLUE clusters. Data conversions between GPU and CPU are included to facilitate the validation of the algorithms and increase the flexibility of the reconstruction. For the first time in HGCAL, conditions data are deployed to the GPU and made available on demand at any stage of the heterogeneous reconstruction. This is achieved via a new geometry ordering scheme in which physical and memory locations are connected. This scheme is successfully tested with the GPU CLUE version reported here, and is expected to have a broad range of applicability for future heterogeneous developments in CMS. Finally, the performance of the combined calibration and clustering algorithms on GPU is assessed and compared to its CPU counterpart.

DOI: 10.1016/j.nima.2024.169100

The k4Clue package: Empowering future collider experiments with the CLUE algorithm

High granularity calorimeters have become increasingly crucial in modern particle physics experiments, and their importance is set to grow even further in the future. The CLUstering of Energy (CLUE) algorithm has shown excellent performance in clustering calorimeter hits in the High Granularity Calorimeter (HGCAL) developed for the Phase-2 upgrade of the CMS experiment. In this paper, the suitability of the CLUE algorithm for future collider experiments has been investigated and its capabilities tested outside the HGCAL reconstruction software. To this end, a new package, k4Clue, was developed which is now fully integrated into the Gaudi software framework and supports the EDM4hep data format for inputs and outputs. The performance of CLUE was demonstrated in three detectors for future colliders: CLICdet for the CLIC accelerator, CLD for the FCC-ee collider and a second calorimeter based on Noble Liquid technology also proposed for FCC-ee. Excellent reconstruction performance was observed for single photon events, even in the presence of noise, and the results are compatible with the performance of the algorithms used currently as the baseline for shower reconstruction in future e+e−-colliders. Moreover, CLUE demonstrates impressive timing capabilities, outperforming the current baseline algorithms and this advantage remains consistent regardless of the number of input hits. This work highlights the adaptability and versatility of the CLUE algorithm for a wide range of experiments and detectors and the algorithm's potential for future high-energy physics experiments beyond CMS.

DOI: 10.1051/epjconf/202429511008

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

In the past years the landscape of tools for expressing parallel algorithms in a portable way across various compute accelerators has continued to evolve significantly. There are many technologies on the market that provide portability between CPU, GPUs from several vendors, and in some cases even FPGAs. These technologies include C++ libraries such as Alpaka and Kokkos, compiler directives such as OpenMP, the SYCL open specification that can be implemented as a library or in a compiler, and standard C++ where the compiler is solely responsible for the offloading. Given this developing landscape, users have to choose the technology that best fits their applications and constraints. For example, in the CMS experiment the experience so far in heterogeneous reconstruction algorithms suggests that the full application contains a large number of relatively short computational kernels and memory transfer operations. In this work we use a stand-alone version of the CMS heterogeneous pixel reconstruction code as a realistic use case of HEP reconstruction software that is capable of leveraging GPUs effectively. We summarize the experience of porting this code base from CUDA to Alpaka, Kokkos, SYCL, std::par, and OpenMP offloading. We compare the event processing throughput achieved by each version on NVIDIA and AMD GPUs as well as on a CPU, and compare those to what a native version of the code achieves on each platform.

DOI: 10.1051/epjconf/201921406025

Large-Scale Distributed Training Applied to Generative Adversarial Networks for Calorimeter Simulation

In recent years, several studies have demonstrated the benefit of using deep learning to solve typical tasks related to high energy physics data taking and analysis. In particular, generative adversarial networks are a good candidate to supplement the simulation of the detector response in a collider environment. Training of neural network models has been made tractable with the improvement of optimization methods and the advent of GP-GPU well adapted to tackle the highly-parallelizable task of training neural nets. Despite these advancements, training of large models over large data sets can take days to weeks. Even more so, finding the best model architecture and settings can take many expensive trials. To get the best out of this new technology, it is important to scale up the available network-training resources and, consequently, to provide tools for optimal large-scale distributed training. In this context, our development of a new training workflow, which scales on multi-node/multi-GPU architectures with an eye to deployment on high performance computing machines is described. We describe the integration of hyper parameter optimization with a distributed training framework using Message Passing Interface, for models defined in keras [12] or pytorch [13]. We present results on the speedup of training generative adversarial networks trained on a data set composed of the energy deposition from electron, photons, charged and neutral hadrons in a fine grained digital calorimeter.

DOI: 10.1088/1742-6596/2438/1/012058

Performance portability for the CMS Reconstruction with Alpaka

Abstract For CMS, Heterogeneous Computing is a powerful tool to face the computational challenges posed by the upgrades of the LHC, and will be used in production at the High Level Trigger during Run 3. In principle, to offload the computational work on non-CPU resources, while retaining their performance, different implementations of the same code are required. This would introduce code-duplication which is not sustainable in terms of maintainability and testability of the software. Performance portability libraries allow to write code once and run it on different architectures with close-to-native performance. The CMS experiment is evaluating performance portability libraries for the near term future.

DOI: 10.1088/1742-6596/331/3/032021

Parallelization of maximum likelihood fits with OpenMP and CUDA

Data analyses based on maximum likelihood fits are commonly used in the high energy physics community for fitting statistical models to data samples. This technique requires the numerical minimization of the negative log-likelihood function. MINUIT is the most common package used for this purpose in the high energy physics community. The main algorithm in this package, MIGRAD, searches the minimum by using the gradient information. The procedure requires several evaluations of the function, depending on the number of free parameters and their initial values. The whole procedure can be very CPU-time consuming in case of complex functions, with several free parameters, many independent variables and large data samples. Therefore, it becomes particularly important to speed-up the evaluation of the negative log-likelihood function. In this paper we present an algorithm and its implementation which benefits from data vectorization and parallelization (based on OpenMP) and which was also ported to Graphics Processing Units using CUDA.

DOI: 10.1088/1742-6596/513/1/012018

NaNet: a low-latency NIC enabling GPU-based, real-time low level trigger systems

We implemented the NaNet FPGA-based PCI2 Gen2 GbE/APElink NIC, featuring GPUDirect RDMA capabilities and UDP protocol management offloading. NaNet is able to receive a UDP input data stream from its GbE interface and redirect it, without any intermediate buffering or CPU intervention, to the memory of a Fermi/Kepler GPU hosted on the same PCIe bus, provided that the two devices share the same upstream root complex. Synthetic benchmarks for latency and bandwidth are presented. We describe how NaNet can be employed in the prototype of the GPU-based RICH low-level trigger processor of the NA62 CERN experiment, to implement the data link between the TEL62 readout boards and the low level trigger processor. Results for the throughput and latency of the integrated system are presented and discussed.

DOI: 10.1109/ipdps.2011.296

Evaluation of Likelihood Functions for Data Analysis on Graphics Processing Units

Data analysis techniques based on likelihood function calculation play a crucial role in many High Energy Physics measurements. Depending on the complexity of the models used in the analyses, with several free parameters, many independent variables, large data samples, and complex functions, the calculation of the likelihood functions can require a long CPU execution time. In the past, the continuous gain in performance for each single CPU core kept pace with the increase on the complexity of the analyses, maintaining reasonable the execution time of the sequential software applications. Nowadays, the performance for single cores is not increasing as in the past, while the complexity of the analyses has grown significantly in the Large Hadron Collider era. In this context a breakthrough is represented by the increase of the number of computational cores per computational node. This allows to speed up the execution of the applications, redesigning them with parallelization paradigms. The likelihood function evaluation can be parallelized using data and task parallelism, which are suitable for CPUs and GPUs (Graphics Processing Units), respectively. In this paper we show how the likelihood function evaluation has been parallelized on GPUs. We describe the implemented algorithm and we give some performance results when running typical models used in High Energy Physics measurements. In our implementation we achieve a good scaling with respect to the number of events of the data samples.

New Track Seeding Techniques for the CMS Experiment

DOI: 10.1088/1742-6596/1085/4/042040

Convolutional Neural Network for Track Seed Filtering at the CMS High-Level Trigger

Starting with Run II, future development projects for the Large Hadron Collider will constantly bring nominal luminosity increase, with the ultimate goal of reaching a peak luminosity of 5·1034cm−2s−1 for ATLAS and CMS experiments planned for the High Luminosity LHC (HL-LHC) upgrade. This rise in luminosity will directly result in an increased number of simultaneous proton collisions (pileup), up to 200, that will pose new challenges for the CMS detector and, specifically, for track reconstruction in the Silicon Pixel Tracker. One of the first steps of the track finding work-flow is the creation of track seeds, i.e. compatible pairs of hits from different detector layers, that are subsequently fed to higher level pattern recognition steps. However, the set of compatible hit pairs is highly affected by combinatorial background resulting in the next steps of the tracking algorithm to process a significant fraction of fake doublets. A possible way of reducing this effect is taking into account the shape of the hit pixel cluster to check the compatibility between two hits. To each doublet is attached a collection of two images built with the ADC levels of the pixels forming the hit cluster. Thus the task of fake rejection can be seen as an image classification problem for which Convolutional Neural Networks (CNNs) have been widely proven to provide reliable results. In this work we present our studies on CNNs applications to the filtering of track pixel seeds. We will show the results obtained for simulated event reconstructed in CMS detector, focusing on the estimation of efficiency and fake rejection performances of our CNN classifier.

DOI: 10.1109/nssmic.2015.7581775

Development of a phase-II track trigger based on GPUs for the CMS experiment

The High Luminosity LHC (HL-LHC) is a project to increase the luminosity of the Large Hadron Collider to 5 · 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">34</sup> cm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-2</sup> s <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">-1</sup> . The CMS experiment at CERN is planning a major upgrade in order to cope with an expected average number of overlapping collisions per bunch crossing of 140. A key element of this upgrade will be the introduction of tracker information at the very first stages of the trigger system for which several possible hardware implementations are under study. In particular the adoption of Graphics Processing Units in the first level of the trigger system is currently being investigated in several HEP experiments. Graphics Processing Units (GPUs) are massively parallel architectures that can be programmed using extensions to the standard C and C++ languages. In a synchronous system they have been proven to be highly reliable and to show a deterministic time response even in presence of branch divergences. These two features allow GPUs to be well suited to run pattern recognition algorithms on detector data in a trigger environment. Our discussion of an implementation of a track trigger system based on GPUs will include a description of the framework developed for moving data from and to multiple GPUs using GPUDirect and executing pattern recognition algorithms.

DOI: 10.1088/1742-6596/523/1/012007

GPUs for real-time processing in HEP trigger systems

We describe a pilot project (GAP – GPU Application Project) for the use of GPUs (Graphics processing units) for online triggering applications in High Energy Physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a fully software data selection system ("trigger-less"). The innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software not only in high level trigger levels but also in early trigger stages. General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerators in offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high energy physics data acquisition and trigger systems is becoming relevant. We discuss in detail the use of online parallel computing on GPUs for synchronous low-level triggers with fixed latency. In particular we show preliminary results on a first test in the CERN NA62 experiment. The use of GPUs in high level triggers is also considered, the CERN ATLAS experiment being taken as a case study of possible applications.

DOI: 10.1088/1748-0221/15/06/c06023

Reconstruction in an imaging calorimeter for HL-LHC

The CMS endcap calorimeter upgrade for the High Luminosity LHC in 2027 uses silicon sensors to achieve radiation tolerance, with the further benefit of a very high readout granularity. Small scintillator tiles with individual SiPM readout are used in regions permitted by the radiation levels. A reconstruction framework is being developed to fully exploit the granularity and other significant features of the detector like precision timing, especially in the high pileup environment of HL-LHC. An iterative clustering framework (TICL) has been put in place, and is being actively developed. The framework takes as input the clusters of energy deposited in individual calorimeter layers delivered by the CLUE algorithm, which has recently been revised and tuned. Mindful of the projected extreme pressure on computing capacity in the HL-LHC era, the algorithms are being designed with modern parallel architectures in mind. Important speedup has recently been obtained for the clustering algorithm by running it on GPUs. Machine learning techniques are being developed and integrated into the reconstruction framework. This paper will describe the approaches being considered and show first results.

DOI: 10.1109/cnna.2012.6331454

Real-time use of GPUs in NA62 experiment

We describe a pilot project for the use of GPUs in a real-time triggering application in the early trigger stages at the CERN NA62 experiment, and the results of the first field tests together with a prototype data acquisition (DAQ) system. This pilot project within NA62 aims at integrating GPUs into the central L0 trigger processor, and also to use them as fast online processors for computing trigger primitives. Several TDC-equipped sub-detectors with sub-nanosecond time resolution will participate in the first-level NA62 trigger (L0), fully integrated with the data-acquisition system, to reduce the readout rate of all sub-detectors to 1 MHz, using multiplicity information asynchronously computed over time frames of a few ns, both for positive sub-detectors and for vetos. The online use of GPUs would allow the computation of more complex trigger primitives already at this first trigger level. We describe the architectures of the proposed systems, focusing on measuring the performance (both throughput and latency) of various approaches meant to solve these high energy physics problems. The challenges and the prospects of this promising idea are discussed.

DOI: 10.1088/1742-6596/513/1/012017

GPUs for real-time processing in HEP trigger systems

We describe a pilot project for the use of Graphics Processing Units (GPUs) for online triggering applications in High Energy Physics (HEP) experiments.Two major trends can be identified in the development of trigger and DAQ systems for HEP experiments: the massive use of general-purpose commodity systems such as commercial multicore PC farms for data acquisition, and the reduction of trigger levels implemented in hardware, towards a pure software selection system (trigger-less).The very innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast computations in software both at low-and high-level trigger stages.General-purpose computing on GPUs is emerging as a new paradigm in several fields of science, although so far applications have been tailored to the specific strengths of such devices as accelerator in offline computation.With the steady reduction of GPU latencies, and the increase in link and memory throughputs, the use of such devices for real-time applications in high-energy physics data acquisition and trigger systems is becoming very attractive.We discuss in details the use of online parallel computing on GPUs for synchronous low-level trigger with fixed latency.In particular we show preliminary results on a first test in the NA62 experiment at CERN.The use of GPUs in high-level triggers is also considered, the ATLAS experiment (and in particular the muon trigger) at CERN will be taken as a study case of possible applications.

DOI: 10.1088/1742-6596/2438/1/012096

The Iterative Clustering framework for the CMS HGCAL Reconstruction

Abstract To sustain the harsher conditions of the high-luminosity LHC [1], the CMS Collaboration [2] is designing a novel endcap calorimeter system [3]. The new calorimeter will predominantly use silicon sensors to achieve sufficient radiation tolerance and will maintain highly granular information in the readout to help mitigate the effects of the pile up. In regions characterized by lower radiation levels, small scintillator tiles with individual SiPM on-tile readout are employed. A unique reconstruction framework (TICL: The Iterative CLustering) is being developed within the CMS Software CMSSW to fully exploit the granularity and other significant detector features, such as particle identification and precision timing, with a view to mitigating pile up in the very dense environment of HL-LHC. The TICL framework has been thought of with heterogeneous computing in mind: the algorithms and their data structures are designed to be executed on GPUs. In addition, geometry agnostic data structures have been designed to provide fast navigation and searching capabilities. Seeding capabilities (also exploiting information coming from other detectors), dynamic cluster masking, energy calibration, and particle identification are the main components of the framework. To allow for maximal flexibility, TICL allows the composition of different combinations of modules that can be chained together in an iterative fashion.

DOI: 10.2172/1973419

Evaluating Performance Portability with the CMS Heterogeneous Pixel Reconstruction code

memory management can depend heavily on the application. In this paper we evaluate the performance impact of CUDA unified memory using the heterogeneous pixel reconstruction code from the CMS experiment as a realistic use case of a GPU-targeting HEP reconstruction software. We also compare the programming model using CUDA unified memory to the explicit management of separate CPU and GPU memory spaces.

DOI: 10.48550/arxiv.2311.03089

The k4Clue package: Empowering Future Collider Experiments with the CLUE Algorithm

High granularity calorimeters have become increasingly crucial in modern particle physics experiments, and their importance is set to grow even further in the future. The CLUstering of Energy (CLUE) algorithm has shown excellent performance in clustering calorimeter hits in the High Granularity Calorimeter (HGCAL) developed for the Phase-2 upgrade of the CMS experiment. In this paper, we investigate the suitability of the CLUE algorithm for future collider experiments and test its capabilities outside the HGCAL software reconstruction. To this end, we developed a new package, k4Clue, which is now fully integrated into the Gaudi software framework and supports the EDM4hep data format for inputs and outputs. We demonstrate the performance of CLUE in three detectors for future colliders: CLICdet for the CLIC accelerator, CLD for the FCC-ee collider and a second calorimeter based on Noble Liquid technology also proposed for FCC-ee. We find excellent reconstruction performance for single gamma events, even in the presence of noise, and also compared with other baseline algorithms. Moreover, CLUE demonstrates impressive timing capabilities, outperforming the other algorithms and independently of the number of input hits. This work highlights the adaptability and versatility of the CLUE algorithm for a wide range of experiments and detectors and the algorithm's potential for future high-energy physics experiments beyond CMS.

DOI: 10.1051/epjconf/202024505005

GPU-based Clustering Algorithm for the CMS High Granularity Calorimeter

The future High Luminosity LHC (HL-LHC) is expected to deliver about 5 times higher instantaneous luminosity than the present LHC, resulting in pile-up up to 200 interactions per bunch crossing (PU200). As part of the phase-II upgrade program, the CMS collaboration is developing a new endcap calorimeter system, the High Granularity Calorimeter (HGCAL), featuring highly-segmented hexagonal silicon sensors and scintillators with more than 6 million channels. For each event, the HGCAL clustering algorithm needs to group more than 10 5 hits into clusters. As consequence of both high pile-up and the high granularity, the HGCAL clustering algorithm is confronted with an unprecedented computing load. CLUE (CLUsters of Energy) is a fast fullyparallelizable density-based clustering algorithm, optimized for high pile-up scenarios in high granularity calorimeters. In this paper, we present both CPU and GPU implementations of CLUE in the application of HGCAL clustering in the CMS Software framework (CMSSW). Comparing with the previous HGCAL clustering algorithm, CLUE on CPU (GPU) in CMSSW is 30x (180x) faster in processing PU200 events while outputting almost the same clustering results.

Strategic RD Programme on Technologies for Future Experiments - Annual Report 2020

DOI: 10.1109/rtc.2014.7097481

The GAP project - GPU for real-time applications in high energy physics and medical imaging

The GAP project aims at the deployment of Graphic Processing Units (GPU) in real-time applications, ranging from online event selection (trigger) in high-energy physics experiments to medical imaging reconstruction. The final goal of the project is to demonstrate that GPUs can have a positive impact in sectors different for rate, bandwidth, and computing intensity. The relevant aspects under study are the analysis of the total latency of the system, the optimization of the computational algorithms, and the integration with the data acquisition system. In this contribution we report on the application of GPUs for trigger selections in particle physics experiments, and for the reconstruction of medical images acquired by a nuclear magnetic resonance system. In particular we discuss how specific trigger algorithms can be naturally parallelized and thus benefit from the implementation on the GPU architecture, in terms of execution speed and complexity of the analyzed events. As a benchmark application we consider the trigger algorithms of two different particle physics experiment: NA62 and Atlas, two different combination of event complexity and processing latency requirements. The fast and parallel execution of the trigger algorithm can improve the resolution of the calculated relevant quantities, that will enrich the purity of the collected data sample. The stability of this solution for increasing complexity of the analyzed events is particularly relevant for its application in the upcoming physics experiment. Most of the future accelerator machine upgrades will push further the rate of data to be processed, hence the GPU can provide a feasible solution to maintain sustainable trigger rates. A similar approach can be applied to medical imaging, with particular reference to NMR scan reconstruction with the kurtosis diffusion method. This recently developed technique is based on computationally very intense algorithms performed thousands of times to reconstruct image properties with a good resolution. The implementation of this elaboration on GPUs can significantly reduce the processing time, making it suitable for the use in real-time diagnostic.

DOI: 10.2172/1570206

CMS Patatrack Project [PowerPoint]

This talk presents the technical performance and lessons learned of the Patatrack demonstator, where the CMS pixel local reconstruction and pixel-only track reconstruction have been ported to NVIDIA GPUs. The demonstrator is run within the CMS software framework (CMSSW), and the model of integrating CUDA algorithms to CMSSW is discussed as well.

CLUE: A Fast Parallel Clustering Algorithm for High Granularity Calorimeters in High Energy Physics.

One of the challenges of high granularity calorimeters, such as that to be built to cover the endcap region in the CMS Phase-2 Upgrade for HL-LHC, is that the large number of channels causes a surge in the computing load when clustering numerous digitised energy deposits (hits) in the reconstruction stage. In this article, we propose a fast and fully-parallelizable density-based clustering algorithm, optimized for high occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster. The algorithm uses a grid spatial index for fast querying of neighbours and its timing scales linearly with the number of hits within the range considered. We also show a comparison of the performance on CPU and GPU implementations, demonstrating the power of algorithmic parallelization in the coming era of heterogeneous computing in high energy physics.

Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker

The High-Luminosity upgrade of the LHC will see the accelerator reach an instantaneous luminosity of $7\times 10^{34} cm^{-2}s^{-1}$ with an average pileup of $200$ proton-proton collisions. These conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centres are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.

DOI: 10.2172/1630717

Bringing heterogeneity to the CMS software framework [Slides]

Co-processors or accelerators like GPUs and FPGAs are becoming more and more popular. CMS’ data processing framework (CMSSW) implements multi-threading using Intel TBB utilizing tasks as concurrent units of work. We have developed generic mechanisms within the CMSSW framework to interact effectively with non-CPU resources and configure CPU and non-CPU algorithms in a unified way. As a first step to gain experience, we have explored mechanisms for how algorithms could offload work to NVIDIA GPUs with CUDA.

DOI: 10.48550/arxiv.2008.13461

Heterogeneous reconstruction of tracks and primary vertices with the CMS pixel tracker

The High-Luminosity upgrade of the LHC will see the accelerator reach an instantaneous luminosity of $7\times 10^{34} cm^{-2}s^{-1}$ with an average pileup of $200$ proton-proton collisions. These conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centres are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.

DOI: 10.1051/epjconf/202125104017

Heterogeneous techniques for rescaling energy deposits in the CMS Phase-2 endcap calorimeter

We present the porting to heterogeneous architectures of the algorithm used for applying linear transformations of raw energy deposits in the CMS High Granularity Calorimeter (HGCAL). This is the first heterogeneous algorithm to be fully integrated with HGCAL’s reconstruction chain. After introducing the latter and giving a brief description of the structural components of HGCAL relevant for this work, the role of the linear transformations in the calibration is reviewed. The many ways in which parallelization is achieved are described, and the successful validation of the heterogeneous algorithm is covered. Detailed performance measurements are presented, including throughput and execution time for both CPU and GPU algorithms, therefore establishing the corresponding speedup. We finally discuss the interplay between this work and the porting of other algorithms in the existing reconstruction chain, as well as integrating algorithms previously ported but not yet integrated.

DOI: 10.3204/desy-proc-2014-05/42

Fast algorithm for real-time rings reconstruction

DOI: 10.3204/desy-proc-2014-05/15

GPUs for the realtime low-level trigger of the NA62 experiment at CERN

A pilot project for the use of GPUs (Graphics processing units) in online triggering ap- plications for high energy physics experiments (HEP) is presented. GPUs offer a highly parallel architecture and the fact that most of the chip resources are devoted to computa- tion. Moreover, they allow to achieve a large computing power using a limited amount of space and power. The application of online parallel computing on GPUs is shown for the synchronous low level trigger of NA62 experiment at CERN. Direct GPU communication using a FPGA-based board has been exploited to reduce the data transmission latency and results on a first field test at CERN will be highlighted. This work is part of a wider project named GAP (GPU application project), intended to study the use of GPUs in real-time applications in both HEP and medical imagin

Development of a parallel trigger framework for rare decay searches

DOI: 10.1109/nssmic.2013.6829757

The GAP project - GPU for realtime applications in high energy physics and medical imaging

We describe a pilot project for the use of GPUs (Graphics Processing Units) in online triggering applications for high energy physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems for data acquisition, such as commercial multicore PC farms, and the reduction of trigger levels implemented in hardware, aimed at a pure software selection system (trigger-less). The very innovative approach presented here aims at exploiting the parallel computing power of commercial GPUs to perform fast software-based computations both in early trigger stages and in high level triggers. General-purpose computing on GPUs is emerging as a new paradigm in several scientific fields. So far, however, GPU applications have only been tailored in order to accelerate offline computation. With the steady reduction of GPU latencies, and the increase in link and memory throughput, such devices have become mature for use in real-time applications in high energy physics data acquisition and trigger systems. We will discuss in detail the use of online parallel computing on GPUs for synchronous low level fixed-latency triggers. We will discuss the preliminary results of a first field test within the NA62 experiment at CERN. The use of GPUs in high level triggers will be also discussed. The ATLAS experiment at CERN, and in particular its muon trigger, will be taken as a case study for possible applications.

DOI: 10.5281/zenodo.1034110

Connecting the dots

DOI: 10.2172/1623357

Bringing heterogeneity to the CMS software framework [Slides]

Co-processors or accelerators like GPUs and FPGAs are becoming more and more popular. CMS' data processing framework (CMSSW) implements multi-threading using Intel TBB utilizing tasks as concurrent units of work. We have developed generic mechanisms within the CMSSW framework to interact effectively with non-CPU resources and configure CPU and non-CPU algorithms in a unified way. As a first step to gain experience, we have explored mechanisms for how algorithms could offload work to NVIDIA GPUs with CUDA.

DOI: 10.48550/arxiv.2001.09761

CLUE: A Fast Parallel Clustering Algorithm for High Granularity Calorimeters in High Energy Physics

One of the challenges of high granularity calorimeters, such as that to be built to cover the endcap region in the CMS Phase-2 Upgrade for HL-LHC, is that the large number of channels causes a surge in the computing load when clustering numerous digitised energy deposits (hits) in the reconstruction stage. In this article, we propose a fast and fully-parallelizable density-based clustering algorithm, optimized for high occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster. The algorithm uses a grid spatial index for fast querying of neighbours and its timing scales linearly with the number of hits within the range considered. We also show a comparison of the performance on CPU and GPU implementations, demonstrating the power of algorithmic parallelization in the coming era of heterogeneous computing in high energy physics.

DOI: 10.5281/zenodo.6618944

TrickTrack: An experiment-independent, cellular- automaton based track seeding library