ϟ

Joosep Pata

Here are all the papers by Joosep Pata that you can download and read on OA.mg.
Joosep Pata’s last known institution is . Download Joosep Pata PDFs here.

Claim this Profile →
DOI: 10.1140/epjc/s10052-021-09158-w
2021
Cited 46 times
MLPF: efficient machine-learned particle-flow reconstruction using graph neural networks
In general-purpose particle detectors, the particle-flow algorithm may be used to reconstruct a comprehensive particle-level view of the event by combining information from the calorimeters and the trackers, significantly improving the detector resolution for jets and the missing transverse momentum. In view of the planned high-luminosity upgrade of the CERN Large Hadron Collider (LHC), it is necessary to revisit existing reconstruction algorithms and ensure that both the physics and computational performance are sufficient in an environment with many simultaneous proton-proton interactions (pileup). Machine learning may offer a prospect for computationally efficient event reconstruction that is well-suited to heterogeneous computing platforms, while significantly improving the reconstruction quality over rule-based algorithms for granular detectors. We introduce MLPF, a novel, end-to-end trainable, machine-learned particle-flow algorithm based on parallelizable, computationally efficient, and scalable graph neural networks optimized using a multi-task objective on simulated events. We report the physics and computational performance of the MLPF algorithm on a Monte Carlo dataset of top quark-antiquark pairs produced in proton-proton collisions in conditions similar to those expected for the high-luminosity LHC. The MLPF algorithm improves the physics response with respect to a rule-based benchmark algorithm and demonstrates computationally scalable particle-flow reconstruction in a high-pileup environment.
DOI: 10.1088/1475-7516/2014/03/053
2014
Cited 64 times
PPPC 4 DMν: a Poor Particle Physicist Cookbook for Neutrinos from Dark Matter annihilations in the Sun
We provide ingredients and recipes for computing neutrino signals of TeV-scale Dark Matter annihilations in the Sun. For each annihilation channel and DM mass we present the energy spectra of neutrinos at production, including: state-of-the-art energy losses of primary particles in solar matter, secondary neutrinos, electroweak radiation. We then present the spectra after propagation to the Earth, including (vacuum and matter) flavor oscillations and interactions in solar matter. We also provide a numerical computation of the capture rate of DM particles in the Sun. These results are available in numerical form.
DOI: 10.48550/arxiv.2003.11603
2020
Cited 39 times
Graph Neural Networks for Particle Reconstruction in High Energy Physics detectors
Pattern recognition problems in high energy physics are notably different from traditional machine learning applications in computer vision. Reconstruction algorithms identify and measure the kinematic properties of particles produced in high energy collisions and recorded with complex detector systems. Two critical applications are the reconstruction of charged particle trajectories in tracking detectors and the reconstruction of particle showers in calorimeters. These two problems have unique challenges and characteristics, but both have high dimensionality, high degree of sparsity, and complex geometric layouts. Graph Neural Networks (GNNs) are a relatively new class of deep learning architectures which can deal with such data effectively, allowing scientists to incorporate domain knowledge in a graph structure and learn powerful representations leveraging that structure to identify patterns of interest. In this work we demonstrate the applicability of GNNs to these two diverse particle reconstruction problems.
DOI: 10.5281/zenodo.10567397
2024
MLPF results on the simulated CLIC dataset
Updates over the previous version: updated validation outputs for the cluster-based model fixed a bug with how the PF candidates were stored added single particle gun samples to validation added new timing runs for the baseline algo, included memory information run the GNN model up to ~10k inputs added hypertuning summary tables Trained models and evaluation results for the upcoming paper "Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors", https://doi.org/10.48550/arXiv.2309.06782. The archive contains the following subfolders: clusters_best_tuned_gnn_clic_v130 MLPF GNN model configs and weight files after hypertuning the inputs are reconstructed tracks and Pandora clusters the outputs are reconstructed PF candidates trained on tt and qq v1.3.0 (1M events each) hits MLPF GNN model configs and weight files inputs are reconstructed tracks and calorimeter hits outputs are reconstructed PF candidates trained on tt, qq and gun samples (K0L, gamma, pi+-, pi0, neutron, ele, mu) v1.2.0 training was restarted several times from previous checkpoints hypertuning GNN and transformer model before and after hypertuning summary tables of the hypertuning runs timing scaling study of baseline PF with number of gun particles on CPU scaling study of GNN model with number of input elements on GPU gpu_scaling the scaling study of model training on multiple accelerator cards The training dataset is available at Pata, Joosep, Wulff, Eric, Duarte, Javier, Mokhtar, Farouk, Zhang, Mengke, Girone, Maria, & Southwick, David. (2023). Simulated datasets for detector and particle flow reconstruction: CLIC detector (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8260741
DOI: 10.1016/j.nuclphysb.2011.08.003
2011
Cited 67 times
Implications of Xenon100 and LHC results for Dark Matter models
We perform a fit to the recent Xenon100 data and study its implications for Dark Matter scenarios. We find that Inelastic Dark Matter is disfavored as an explanation to the DAMA/LIBRA annual modulation signal. Concerning the scalar singlet DM model, we find that the Xenon100 data disfavors its constrained limit. We study the CMSSM as well as the low scale phenomenological MSSM taking into account latest Tevatron and LHC data (1.1/fb) about sparticles and Bs→μμ. After the EPS 2011 conference, LHC excludes the “Higgs-resonance” region of DM freeze-out and Xenon100 disfavors the “well-tempered” bino/higgsino, realized in the “focus-point” region of the CMSSM parameter space. The preferred region shifts to heavier sparticles, higher fine-tuning, higher tanβ and the quality of the fit deteriorates.
DOI: 10.1088/1742-6596/2438/1/012100
2023
Cited 5 times
Machine Learning for Particle Flow Reconstruction at CMS
We provide details on the implementation of a machine-learning based particle flow algorithm for CMS. The standard particle flow algorithm reconstructs stable particles based on calorimeter clusters and tracks to provide a global event reconstruction that exploits the combined information of multiple detector subsystems, leading to strong improvements for quantities such as jets and missing transverse energy. We have studied a possible evolution of particle flow towards heterogeneous computing platforms such as GPUs using a graph neural network. The machine-learned PF model reconstructs particle candidates based on the full list of tracks and calorimeter clusters in the event. For validation, we determine the physics performance directly in the CMS software framework when the proposed algorithm is interfaced with the offline reconstruction of jets and missing transverse energy. We also report the computational performance of the algorithm, which scales approximately linearly in runtime and memory usage with the input size.
DOI: 10.1088/1742-6596/2438/1/012092
2023
Hyperparameter optimization of data-driven AI models on HPC systems
Abstract In the European Center of Excellence in Exascale Computing ”Research on AI- and Simulation-Based Engineering at Exascale” (CoE RAISE), researchers develop novel, scalable AI technologies towards Exascale. This work exercises High Performance Computing resources to perform large-scale hyperparameter optimization using distributed training on multiple compute nodes. This is part of RAISE’s work on data-driven use cases which leverages AI- and HPC cross-methods developed within the project. In response to the demand for parallelizable and resource efficient hyperparameter optimization methods, advanced hyperparameter search algorithms are benchmarked and compared. The evaluated algorithms, including Random Search, Hyperband and ASHA, are tested and compared in terms of both accuracy and accuracy per compute resources spent. As an example use case, a graph neural network model known as MLPF, developed for Machine Learned Particle-Flow reconstruction, acts as the base model for optimization. Results show that hyperparameter optimization significantly increased the performance of MLPF and that this would not have been possible without access to large-scale High Performance Computing resources. It is also shown that, in the case of MLPF, the ASHA algorithm in combination with Bayesian optimization gives the largest performance increase per compute resources spent out of the investigated algorithms.
DOI: 10.1103/physrevd.108.036023
2023
Dynamics of false vacuum bubbles with trapped particles
We study the impact of the ambient fluid on the evolution of collapsing false vacuum bubbles by simulating the dynamics of a coupled bubble-particle system. A significant increase in the mass of the particles across the bubble wall leads to a buildup of those particles inside the false vacuum bubble. We show that the backreaction of the particles on the bubble slows or even reverses the collapse. Consequently, if the particles in the true vacuum become heavier than in the false vacuum, the particle-wall interactions always decrease the compactness that the false vacuum bubbles can reach, making their collapse to black holes less likely.
DOI: 10.1016/j.cpc.2024.109095
2024
Tau lepton identification and reconstruction: a new frontier for jet-tagging ML algorithms
Identifying and reconstructing hadronic τ decays (τh) is an important task at current and future high-energy physics experiments, as τh represent an important tool to analyze the production of Higgs and electroweak bosons as well as to search for physics beyond the Standard Model. The identification of τh can be viewed as a generalization and extension of jet-flavour tagging, which has in the recent years undergone significant progress due to the use of deep learning. Based on a granular simulation with realistic detector effects and a particle flow-based event reconstruction, we show in this paper that deep learning-based jet-flavour-tagging algorithms are powerful τh identifiers. Specifically, we show that jet-flavour-tagging algorithms such as LorentzNet and ParticleTransformer can be adapted in an end-to-end fashion for discriminating τh from quark and gluon jets. We find that the end-to-end transformer-based approach significantly outperforms contemporary state-of-the-art τh reconstruction and identification algorithms currently in use at the Large Hadron Collider.
DOI: 10.1038/s42005-024-01599-5
2024
Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors
Abstract Efficient and accurate algorithms are necessary to reconstruct particles in the highly granular detectors anticipated at the High-Luminosity Large Hadron Collider and the Future Circular Collider. We study scalable machine learning models for event reconstruction in electron-positron collisions based on a full detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters. We compare a graph neural network and kernel-based transformer and demonstrate that we can avoid quadratic operations while achieving realistic reconstruction. We show that hyperparameter tuning significantly improves the performance of the models. The best graph neural network model shows improvement in the jet transverse momentum resolution by up to 50% compared to the rule-based algorithm. The resulting model is portable across Nvidia, AMD and Habana hardware. Accurate and fast machine-learning based reconstruction can significantly improve future measurements at colliders.
DOI: 10.3929/ethz-b-000271889
2018
Cited 16 times
Observation of ttH Production
The observation of Higgs boson production in association with a top quark-antiquark pair is reported, based on a combined analysis of proton-proton collision data at center-of-mass energies of √s = 7, 8, and 13 TeV, corresponding to integrated luminosities of up to 5.1, 19.7, and 35.9  fb^(-1), respectively. The data were collected with the CMS detector at the CERN LHC. The results of statistically independent searches for Higgs bosons produced in conjunction with a top quark-antiquark pair and decaying to pairs of W bosons, Z bosons, photons, τ leptons, or bottom quark jets are combined to maximize sensitivity. An excess of events is observed, with a significance of 5.2 standard deviations, over the expectation from the background-only hypothesis. The corresponding expected significance from the standard model for a Higgs boson mass of 125.09 GeV is 4.2 standard deviations. The combined best fit signal strength normalized to the standard model prediction is 1.26^(+0.31)_(−0.26).
DOI: 10.48550/arxiv.2303.17657
2023
Progress towards an improved particle flow algorithm at CMS with machine learning
The particle-flow (PF) algorithm, which infers particles based on tracks and calorimeter clusters, is of central importance to event reconstruction in the CMS experiment at the CERN LHC, and has been a focus of development in light of planned Phase-2 running conditions with an increased pileup and detector granularity. In recent years, the machine learned particle-flow (MLPF) algorithm, a graph neural network that performs PF reconstruction, has been explored in CMS, with the possible advantages of directly optimizing for the physical quantities of interest, being highly reconfigurable to new conditions, and being a natural fit for deployment to heterogeneous accelerators. We discuss progress in CMS towards an improved implementation of the MLPF reconstruction, now optimized using generator/simulation-level particle information as the target for the first time. This paves the way to potentially improving the detector response in terms of physical quantities of interest. We describe the simulation-based training target, progress and studies on event-based loss terms, details on the model hyperparameter tuning, as well as physics validation with respect to the current PF algorithm in terms of high-level physical quantities such as the jet and missing transverse momentum resolutions. We find that the MLPF algorithm, trained on a generator/simulator level particle information for the first time, results in broadly compatible particle and jet reconstruction performance with the baseline PF, setting the stage for improving the physics performance by additional training statistics and model tuning.
DOI: 10.1051/0004-6361/202346474
2023
A Bayesian estimation of the Milky Way’s circular velocity curve using <i>Gaia</i> DR3
Aims. Our goal is to calculate the circular velocity curve of the Milky Way, along with corresponding uncertainties that quantify various sources of systematic uncertainty in a self-consistent manner. Methods. The observed rotational velocities are described as circular velocities minus the asymmetric drift. The latter is described by the radial axisymmetric Jeans equation. We thus reconstruct the circular velocity curve between Galactocentric distances from 5 kpc to 14 kpc using a Bayesian inference approach. The estimated error bars quantify uncertainties in the Sun’s Galactocentric distance and the spatial-kinematic morphology of the tracer stars. As tracers, we used a sample of roughly 0.6 million stars on the red giant branch stars with six-dimensional phase-space coordinates from Gaia Data Release 3 (DR3). More than 99% of the sample is confined to a quarter of the stellar disc with mean radial, rotational, and vertical velocity dispersions of (35 ± 18) km s −1 , (25 ± 13) km s −1 , and (19 ± 9) km s −1 , respectively. Results. We find a circular velocity curve with a slope of 0.4 ± 0.6 km s −1 kpc −1 , which is consistent with a flat curve within the uncertainties. We further estimate a circular velocity at the Sun’s position of v c ( R 0 ) = 233 ± 7 km s −1 and that a region in the Sun’s vicinity, characterised by a physical length scale of ∼1 kpc, moves with a bulk motion of V LSR = 7 ± 7 km s −1 . Finally, we estimate that the dark matter (DM) mass within 14 kpc is log 10 M DM (R &lt; 14kpc)/ M ⊙ =(11.2 +2.0 -2.3 ) and the local spherically averaged DM density is ρ DM ( R O )=(0.41 +0.10 -0.09 ) GeV cm -3 = (0.011 +0.003 -0.002 ) M ⊙ pc -3 . In addition, the effect of biased distance estimates on our results is assessed.
DOI: 10.5281/zenodo.8260741
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector
<strong>Data description</strong> Datasets generated using Key4HEP and the CLIC detector model suitable for particle flow reconstruction studies. The datasets contain generator particles, reconstructed tracks and calorimeter hits, reconstructed Pandora PF particles and their respective links in the EDM4HEP format. The following processes have been simulated with Pythia 8: p8_ee_tt_ecm380: ee -&gt; ttbar, center of mass energy at 380 GeV p8_ee_qq_ecm380: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV p8_ee_ZH_Htautau: ee -&gt; ZH -&gt; Higgs decaying to tau leptons, center of mass energy at 380 GeV p8_ee_WW_fullhad: ee -&gt; WW -&gt; W decaying hadronically, center of mass energy at 380 GeV p8_ee_tt_ecm380_PU10: ee -&gt; ttbar with on average 10 Poisson-distributed events from ee-&gt;gg overlayed, center of mass energy at 380 GeV The following single particle gun samples have been generated with ddsim: e+/e-: single electron with energy between 1 and 100 GeV mu+/mu-: single muon with energy between 1 and 100 GeV kaon0L: single K0L with energy between 1 and 100 GeV neutron: single neutron with energy between 1 and 100 GeV pi+/pi-: single charged pion with energy between 1 and 100 GeV pi0: single neutral pion with energy between 1 and 100 GeV gamma: single photon with energy between 1 and 100 GeV The detector simulation has been done with Geant4, the reconstruction with Marlin interfaced via Key4HEP which includes PF reconstruction with Pandora, all using publicly available models and code. <strong>Contents</strong> This record includes the following files: *_10files.tar: small archives of 10 files for each data sample, suitable for testing dataset_full.txt: the full list of files, hosted at the Julich HPC courtesy of the Raise CoE project, ~2.5TB total *.cmd: the Pythia8 cards pythia.py: the pythia steering code for Key4HEP run_sim.sh: the steering script for generating, simulating and reconstructing a single file of 100 events from the p8_ee_tt_ecm380, p8_ee_qq_ecm380, p8_ee_ZH_Htautau, p8_ee_WW_fullhad datasets run_sim_pu.sh: the steering script for generating, simulating and reconstructing a single file of 100 events from the p8_ee_tt_ecm380_PU10 dataset run_sim_gun.sh: the steering script for generating the single-particle gun samples run_sim_gun_np.sh: the steering script for generating multi-particle gun samples (extensive datasets have not yet been generated) check_files.py: the main driver script that configures the full statistics and creates submission scripts for all the simulations PandoraSettings.zip: the settings used for Pandora PF reconstruction main19.cc: the Pythia8+HepMC driver code for generating the events with PU overlay clicRec_e4h_input.py: the steering configuration of the reconstruction modules in Key4HEP clic_steer.py: the steering configuration of the Geant4 simulation modules in Key4HEP clic-visualize.ipynb: an example notebook demonstrating how the dataset can be loaded and events visualized in Python visualization.mp4: an example visualization of the hits and generator particles of a single ttbar event from the dataset <strong>Dataset semantics</strong> Each file consists of event records. Each event contains structured branches of the relevant physics data. The branches relevant to particle flow reconstruction include: MCParticles: the ground truth generator particles ECALBarrel, ECALEndcap, ECALOther, HCALBarrel, HCALEndcap, HCALOther, MUON: reconstructed hits in the various calorimeter subsystems SiTracks_Refitted: the reconstructed tracks PandoraClusters: the calorimeter hits, clustered by Pandora to calorimeter clusters MergedRecoParticles: the reconstructed particles from the Pandora particle flow algorithm CalohitMCTruthLink: the links between MC particles and reconstructed calorimeter hits SiTracksMCTruthLink: the links between MC particles and reconstructed tracks The full details of the EDM4HEP format are available here. <strong>Dataset characteristics</strong> The full dataset in dataset_full.txt consists of 43 tar files of up to 100GB each. The tar files contain in total 58068 files, 2.5TB in the ROOT EDM4HEP format. The subset in *_10files.tar for consists of 150 files, 26GB in the ROOT EDM4HEP format. <strong>How can you use these data?</strong> The ROOT files can be directly loaded with the uproot Python library. <strong>Disclaimer</strong> These are simulated samples suitable for conceptual machine learning R&amp;D and software performance studies. They have not been calibrated with respect to real data, and should not be used to derive physics projections about the detectors. Neither CLIC nor CERN endorse any works, scientific or otherwise, produced using these data. All releases will have a unique DOI that you are requested to cite in any applications or publications.
DOI: 10.48550/arxiv.2305.07702
2023
Dynamics of false vacuum bubbles with trapped particles
We study the impact of the ambient fluid on the evolution of collapsing false vacuum bubbles by simulating the dynamics of a coupled bubble-particle system. A significant increase in the mass of the particles across the bubble wall leads to a buildup of those particles inside the false vacuum bubble. We show that the backreaction of the particles on the bubble slows or even reverses the collapse. Consequently, if the particles in the true vacuum become heavier than in the false vacuum, the particle-wall interactions always decrease the compactness that the false vacuum bubbles can reach making their collapse to black holes less likely.
DOI: 10.48550/arxiv.2307.07747
2023
Tau lepton identification and reconstruction: a new frontier for jet-tagging ML algorithms
Identifying and reconstructing hadronic $\tau$ decays ($\tau_{\textrm{h}}$) is an important task at current and future high-energy physics experiments, as $\tau_{\textrm{h}}$ represent an important tool to analyze the production of Higgs and electroweak bosons as well as to search for physics beyond the Standard Model. The identification of $\tau_{\textrm{h}}$ can be viewed as a generalization and extension of jet-flavour tagging, which has in the recent years undergone significant progress due to the use of deep learning. Based on a granular simulation with realistic detector effects and a particle flow-based event reconstruction, we show in this paper that deep learning-based jet-flavour-tagging algorithms are powerful $\tau_{\textrm{h}}$ identifiers. Specifically, we show that jet-flavour-tagging algorithms such as LorentzNet and ParticleTransformer can be adapted in an end-to-end fashion for discriminating $\tau_{\textrm{h}}$ from quark and gluon jets. We find that the end-to-end transformer-based approach significantly outperforms contemporary state-of-the-art $\tau_{\textrm{h}}$ reconstruction and identification algorithms currently in use at the Large Hadron Collider.
DOI: 10.48550/arxiv.2309.06782
2023
Improved particle-flow event reconstruction with scalable neural networks for current and future particle detectors
Efficient and accurate algorithms are necessary to reconstruct particles in the highly granular detectors anticipated at the High-Luminosity Large Hadron Collider and the Future Circular Collider. We study scalable machine learning models for event reconstruction in electron-positron collisions based on a full detector simulation. Particle-flow reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters. We compare a graph neural network and kernel-based transformer and demonstrate that we can avoid quadratic operations while achieving realistic reconstruction. We show that hyperparameter tuning significantly improves the performance of the models. The best graph neural network model shows improvement in the jet transverse momentum resolution by up to 50% compared to the rule-based algorithm. The resulting model is portable across Nvidia, AMD and Habana hardware. Accurate and fast machine-learning based reconstruction can significantly improve future measurements at colliders.
DOI: 10.21203/rs.3.rs-3466159/v1
2023
Scalable neural network models and terascale datasets for particle-flow reconstruction
Abstract We study scalable machine learning models for full event reconstruction in high-energy electron-positron collisions based on a highly granular detector simulation. Particle-flow (PF) reconstruction can be formulated as a supervised learning task using tracks and calorimeter clusters or hits. We compare a graph neural network and kernel-based transformer and demonstrate that both avoid quadratic memory allocation and computational cost while achieving realistic PF reconstruction. We show that hyperparameter tuning on a supercomputer significantly enhances the physics performance of the models, improving the jet transverse momentum resolution by up to 50% compared to the baseline. The resulting model is highly portable across hardware processors, supporting Nvidia, AMD, and Intel Habana cards. Finally, we demonstrate that the model can be trained on highly granular inputs consisting of tracks and calorimeter hits, resulting in a competitive physics performance with the baseline. Datasets and software to reproduce the studies are published following the findable, accessible, interoperable, and reusable (FAIR) principles.
DOI: 10.5281/zenodo.8328682
2023
MLPF results on the simulated CLIC dataset
Updates over the previous version: updated plots for the cluster-based model added Gaudi2 timing split big zip to smaller splits, all .zip and .zXX files must be downloaded and unzipped together Trained models and evaluation results for the upcoming paper "Scalable neural network models and terascale datasets for particle-flow reconstruction". The archive contains the following subfolders: clusters_best_tuned_gnn_clic_v130 MLPF GNN model after hypertuning the inputs are reconstructed tracks and Pandora clusters the outputs are reconstructed PF candidates trained on tt and qq v1.3.0 (1M events each) hits MLPF GNN model inputs are reconstructed tracks and calorimeter hits outputs are reconstructed PF candidates trained on tt, qq and gun samples (K0L, gamma, pi+-, pi0, neutron, ele, mu) v1.2.0 training was restarted several times from previous checkpoints hypertuning GNN and transformer model before and after hypertuning timing scaling study of baseline PF with number of gun particles on CPU scaling study of GNN model with number of input elements on GPU gpu_scaling the scaling study of model training on multiple accelerator cards The training dataset is available at Pata, Joosep, Wulff, Eric, Duarte, Javier, Mokhtar, Farouk, Zhang, Mengke, Girone, Maria, & Southwick, David. (2023). Simulated datasets for detector and particle flow reconstruction: CLIC detector (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8260741
DOI: 10.5281/zenodo.8085321
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector
Datasets generated using Key4HEP and the CLIC detector suitable for particle flow reconstruction studies. The datasets contain generator particles, reconstructed tracks and calorimeter hits, reconstructed Pandora PF particles and their respective links in the EDM4HEP format. The following processes have been simulated: tt: ee-&gt;ttbar at 380 GeV qq: ee-&gt; Z* -&gt; qqbar at 380 GeV e+/e-: single electron with momentum between 1 and 100 GeV mu+/mu-: single muon with momentum between 1 and 100 GeV kaon0L: single K0L with momentum between 1 and 100 GeV neutron: single neutron with momentum between 1 and 100 GeV pi+/pi-: single charged pion with momentum between 1 and 100 GeV pi0: single neutral pion with momentum between 1 and 100 GeV gamma: single photon with momentum between 1 and 100 GeV The hard interaction has been generated with Pythia 8 (pythia.py, *.cmd), the detector simulation has been done with Geant4 (clic_steer.py), the reconstruction with Marlin interfaced via Key4HEP (clicRec_e4h_input.py), which includes PF reconstruction with Pandora (PandoraSettings.zip). The main steering scripts for generating the simulations are run_sim.sh (qq and tt) and run_sim_gun.sh (particle gun), which also contain the exact versions of the software and the detector.<br>
DOI: 10.5281/zenodo.8409591
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format
Derived from https://zenodo.org/record/8260741, prepared in a machine-learning friendly TFDS format, ready to be used with https://zenodo.org/record/8397954. clic_edm_ttbar_pf.tar: ee -&gt; ttbar, center of mass energy at 380 GeV clic_edm_qq_pf.tar: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV clic_edm_ww_fullhad_pf.tar: ee -&gt; WW -&gt; W decaying hadronically, center of mass energy at 380 GeV clic_edm_zh_tautau_pf.tar: ee -&gt; ZH -&gt; Higgs decaying to tau leptons, center of mass energy at 380 GeV <strong>Contents</strong> Each .tar file contains the dataset in the tensorflow-datasets (minimum version v4.9.1), array_record format. <strong>Dataset semantics</strong> Each dataset consists of events that can be iterated over using the tensorflow-datasets library in either tensorflow or pytorch. Each event has the following information available: X: the reconstruction input features, i.e. tracks and clusters ygen: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v1.6/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py.
DOI: 10.5281/zenodo.8414224
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector, hit-based data, machine learning format
Derived from https://zenodo.org/record/8260741, prepared in a machine-learning friendly TFDS format, ready to be used with https://zenodo.org/record/8397954. clic_edm_ttbar_hits_pf10k.tar: ee -&gt; ttbar, center of mass energy at 380 GeV, 10k events clic_edm_qq_hits_pf10k.tar: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV, 10k events <strong>Contents</strong> Each .tar file contains the dataset in the tensorflow-datasets (minimum version v4.9.1), array_record format. <strong>Dataset semantics</strong> Each dataset consists of events that can be iterated over using the tensorflow-datasets library in either tensorflow or pytorch. Each event has the following information available: X: the reconstruction input features, i.e. tracks and calorimeter hits ygen: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v1.6/mlpf/heptfds/clic_pf_edm4hep_hits/utils_edm.py.
DOI: 10.5281/zenodo.8328683
2023
MLPF results on the simulated CLIC dataset
Trained models and evaluation results for the upcoming paper "Scalable neural network models and terascale datasets for particle-flow reconstruction". The archive contains the following subfolders: clusters_best_tuned_gnn_clic_v130 MLPF GNN model after hypertuning the inputs are reconstructed tracks and Pandora clusters the outputs are reconstructed PF candidates trained on tt and qq v1.3.0 (1M events each) hits MLPF GNN model inputs are reconstructed tracks and calorimeter hits outputs are reconstructed PF candidates trained on tt, qq and gun samples (K0L, gamma, pi+-, pi0, neutron, ele, mu) v1.2.0 training was restarted several times from previous checkpoints hypertuning GNN and transformer model before and after hypertuning timing scaling study of baseline PF with number of gun particles on CPU scaling study of GNN model with number of input elements on GPU gpu_scaling the scaling study of model training on multiple accelerator cards The training dataset is available at Pata, Joosep, Wulff, Eric, Duarte, Javier, Mokhtar, Farouk, Zhang, Mengke, Girone, Maria, & Southwick, David. (2023). Simulated datasets for detector and particle flow reconstruction: CLIC detector (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8260741
DOI: 10.5281/zenodo.7892204
2023
Collapse of false vacuum bubbles with trapped particles - simulation data
This repository contains data of simulations for looking into false vacuum bubble collapses. In this scenario we assume feebly interacting particles which get trapped inside the false vacuum bubbles.
DOI: 10.5281/zenodo.8414225
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector, hit-based data, machine learning format
Derived from https://zenodo.org/record/8260741, prepared in a machine-learning friendly TFDS format, ready to be used with https://zenodo.org/record/8397954. clic_edm_ttbar_hits_pf10k.tar: ee -&gt; ttbar, center of mass energy at 380 GeV, 10k events clic_edm_qq_hits_pf10k.tar: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV, 10k events <strong>Contents</strong> Each .tar file contains the dataset in the tensorflow-datasets (minimum version v4.9.1), array_record format. <strong>Dataset semantics</strong> Each dataset consists of events that can be iterated over using the tensorflow-datasets library in either tensorflow or pytorch. Each event has the following information available: X: the reconstruction input features, i.e. tracks and calorimeter hits ygen: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v1.6/mlpf/heptfds/clic_pf_edm4hep_hits/utils_edm.py.
DOI: 10.5281/zenodo.8085320
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector
<strong>Data description</strong> Datasets generated using Key4HEP and the CLIC detector model suitable for particle flow reconstruction studies. The datasets contain generator particles, reconstructed tracks and calorimeter hits, reconstructed Pandora PF particles and their respective links in the EDM4HEP format. The following processes have been simulated with Pythia 8: p8_ee_tt_ecm380: ee -&gt; ttbar, center of mass energy at 380 GeV p8_ee_qq_ecm380: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV p8_ee_ZH_Htautau: ee -&gt; ZH -&gt; Higgs decaying to tau leptons, center of mass energy at 380 GeV p8_ee_WW_fullhad: ee -&gt; WW -&gt; W decaying hadronically, center of mass energy at 380 GeV p8_ee_tt_ecm380_PU10: ee -&gt; ttbar with on average 10 Poisson-distributed events from ee-&gt;gg overlayed, center of mass energy at 380 GeV The following single particle gun samples have been generated with ddsim: e+/e-: single electron with energy between 1 and 100 GeV mu+/mu-: single muon with energy between 1 and 100 GeV kaon0L: single K0L with energy between 1 and 100 GeV neutron: single neutron with energy between 1 and 100 GeV pi+/pi-: single charged pion with energy between 1 and 100 GeV pi0: single neutral pion with energy between 1 and 100 GeV gamma: single photon with energy between 1 and 100 GeV The detector simulation has been done with Geant4, the reconstruction with Marlin interfaced via Key4HEP which includes PF reconstruction with Pandora, all using publicly available models and code. <strong>Contents</strong> This record includes the following files: *_10files.tar: small archives of 10 files for each data sample, suitable for testing dataset_full.txt: the full list of files, hosted at the Julich HPC courtesy of the Raise CoE project, ~2.5TB total *.cmd: the Pythia8 cards pythia.py: the pythia steering code for Key4HEP run_sim.sh: the steering script for generating, simulating and reconstructing a single file of 100 events from the p8_ee_tt_ecm380, p8_ee_qq_ecm380, p8_ee_ZH_Htautau, p8_ee_WW_fullhad datasets run_sim_pu.sh: the steering script for generating, simulating and reconstructing a single file of 100 events from the p8_ee_tt_ecm380_PU10 dataset run_sim_gun.sh: the steering script for generating the single-particle gun samples run_sim_gun_np.sh: the steering script for generating multi-particle gun samples (extensive datasets have not yet been generated) check_files.py: the main driver script that configures the full statistics and creates submission scripts for all the simulations PandoraSettings.zip: the settings used for Pandora PF reconstruction main19.cc: the Pythia8+HepMC driver code for generating the events with PU overlay clicRec_e4h_input.py: the steering configuration of the reconstruction modules in Key4HEP clic_steer.py: the steering configuration of the Geant4 simulation modules in Key4HEP clic-visualize.ipynb: an example notebook demonstrating how the dataset can be loaded and events visualized in Python visualization.mp4: an example visualization of the hits and generator particles of a single ttbar event from the dataset <strong>Dataset semantics</strong> Each file consists of event records. Each event contains structured branches of the relevant physics data. The branches relevant to particle flow reconstruction include: MCParticles: the ground truth generator particles ECALBarrel, ECALEndcap, ECALOther, HCALBarrel, HCALEndcap, HCALOther, MUON: reconstructed hits in the various calorimeter subsystems SiTracks_Refitted: the reconstructed tracks PandoraClusters: the calorimeter hits, clustered by Pandora to calorimeter clusters MergedRecoParticles: the reconstructed particles from the Pandora particle flow algorithm CalohitMCTruthLink: the links between MC particles and reconstructed calorimeter hits SiTracksMCTruthLink: the links between MC particles and reconstructed tracks The full details of the EDM4HEP format are available here. <strong>Dataset characteristics</strong> The full dataset in dataset_full.txt consists of 43 tar files of up to 100GB each. The tar files contain in total 58068 files, 2.5TB in the ROOT EDM4HEP format. The subset in *_10files.tar for consists of 150 files, 26GB in the ROOT EDM4HEP format. <strong>How can you use these data?</strong> The ROOT files can be directly loaded with the uproot Python library. <strong>Disclaimer</strong> These are simulated samples suitable for conceptual machine learning R&amp;D and software performance studies. They have not been calibrated with respect to real data, and should not be used to derive physics projections about the detectors. Neither CLIC nor CERN endorse any works, scientific or otherwise, produced using these data. All releases will have a unique DOI that you are requested to cite in any applications or publications.
DOI: 10.5281/zenodo.7892203
2023
Collapse of false vacuum bubbles with trapped particles - simulation data
This repository contains data of simulations for looking into false vacuum bubble collapses. In this scenario we assume feebly interacting particles which get trapped inside the false vacuum bubbles.
DOI: 10.5281/zenodo.8409592
2023
Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format
Derived from https://zenodo.org/record/8260741, prepared in a machine-learning friendly TFDS format, ready to be used with https://zenodo.org/record/8397954. clic_edm_ttbar_pf.tar: ee -&gt; ttbar, center of mass energy at 380 GeV clic_edm_qq_pf.tar: ee -&gt; Z* -&gt; qqbar, center of mass energy at 380 GeV clic_edm_ww_fullhad_pf.tar: ee -&gt; WW -&gt; W decaying hadronically, center of mass energy at 380 GeV clic_edm_zh_tautau_pf.tar: ee -&gt; ZH -&gt; Higgs decaying to tau leptons, center of mass energy at 380 GeV <strong>Contents</strong> Each .tar file contains the dataset in the tensorflow-datasets (minimum version v4.9.1), array_record format. <strong>Dataset semantics</strong> Each dataset consists of events that can be iterated over using the tensorflow-datasets library in either tensorflow or pytorch. Each event has the following information available: X: the reconstruction input features, i.e. tracks and clusters ygen: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v1.6/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py.
DOI: 10.5281/zenodo.10006037
2023
MLPF results on the simulated CLIC dataset
Updates over the previous version: updated plots for the cluster-based model added Gaudi2 timing split big zip to smaller splits, all .zip and .zXX files must be downloaded and unzipped together Trained models and evaluation results for the upcoming paper "Scalable neural network models and terascale datasets for particle-flow reconstruction". The archive contains the following subfolders: clusters_best_tuned_gnn_clic_v130 MLPF GNN model after hypertuning the inputs are reconstructed tracks and Pandora clusters the outputs are reconstructed PF candidates trained on tt and qq v1.3.0 (1M events each) hits MLPF GNN model inputs are reconstructed tracks and calorimeter hits outputs are reconstructed PF candidates trained on tt, qq and gun samples (K0L, gamma, pi+-, pi0, neutron, ele, mu) v1.2.0 training was restarted several times from previous checkpoints hypertuning GNN and transformer model before and after hypertuning timing scaling study of baseline PF with number of gun particles on CPU scaling study of GNN model with number of input elements on GPU gpu_scaling the scaling study of model training on multiple accelerator cards The training dataset is available at Pata, Joosep, Wulff, Eric, Duarte, Javier, Mokhtar, Farouk, Zhang, Mengke, Girone, Maria, & Southwick, David. (2023). Simulated datasets for detector and particle flow reconstruction: CLIC detector (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.8260741
DOI: 10.48550/arxiv.2111.12840
2021
Explaining machine-learned particle-flow reconstruction
The particle-flow (PF) algorithm is used in general-purpose particle detectors to reconstruct a comprehensive particle-level view of the collision by combining information from different subdetectors. A graph neural network (GNN) model, known as the machine-learned particle-flow (MLPF) algorithm, has been developed to substitute the rule-based PF algorithm. However, understanding the model's decision making is not straightforward, especially given the complexity of the set-to-set prediction task, dynamic graph building, and message-passing steps. In this paper, we adapt the layerwise-relevance propagation technique for GNNs and apply it to the MLPF algorithm to gauge the relevant nodes and features for its predictions. Through this process, we gain insight into the model's decision-making.
DOI: 10.48550/arxiv.2203.00330
2022
Machine Learning for Particle Flow Reconstruction at CMS
We provide details on the implementation of a machine-learning based particle flow algorithm for CMS. The standard particle flow algorithm reconstructs stable particles based on calorimeter clusters and tracks to provide a global event reconstruction that exploits the combined information of multiple detector subsystems, leading to strong improvements for quantities such as jets and missing transverse energy. We have studied a possible evolution of particle flow towards heterogeneous computing platforms such as GPUs using a graph neural network. The machine-learned PF model reconstructs particle candidates based on the full list of tracks and calorimeter clusters in the event. For validation, we determine the physics performance directly in the CMS software framework when the proposed algorithm is interfaced with the offline reconstruction of jets and missing transverse energy. We also report the computational performance of the algorithm, which scales approximately linearly in runtime and memory usage with the input size.
DOI: 10.48550/arxiv.1906.06242
2019
Processing Columnar Collider Data with GPU-Accelerated Kernels
At high energy physics experiments, processing billions of records of structured numerical data from collider events to a few statistical summaries is a common task. The data processing is typically more complex than standard query languages allow, such that custom numerical codes are used. At present, these codes mostly operate on individual event records and are parallelized in multi-step data reduction workflows using batch jobs across CPU farms. Based on a simplified top quark pair analysis with CMS Open Data, we demonstrate that it is possible to carry out significant parts of a collider analysis at a rate of around a million events per second on a single multicore server with optional GPU acceleration. This is achieved by representing HEP event data as memory-mappable sparse arrays of columns, and by expressing common analysis operations as kernels that can be used to process the event data in parallel. We find that only a small number of relatively simple functional kernels are needed for a generic HEP analysis. The approach based on columnar processing of data could speed up and simplify the cycle for delivering physics results at HEP experiments. We release the \texttt{hepaccelerate} prototype library as a demonstrator of such methods.
DOI: 10.5281/zenodo.4559324
2021
Simulated particle-level events of ttbar and QCD with PU200 using Pythia8+Delphes3 for machine learned particle flow (MLPF)
Dataset of 50,000 top quark-antiquark (ttbar) and 5,000 QCD events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation. An explanation of the dataset and how to load it can be found in the included jupyter notebook delphes_dataset.ipynb. The simulated events are stored in Bzip2-compressed python pickle files in tev14_pythia8_{sample}_{seed}_{idx}.pkl.bz2, where {sample} is ttbar or qcd, {seed} is the random seed, {idx} is the file index. Each file contains 100 events. The Pythia8 configs are found in tev14_pythia8_ttbar.py and tev14_pythia8_qcd.py, the Delphes config in delphes_card_CMS_PileUp.tcl.
2014
Measurement of top quark polarisation in t-channel single top production with the CMS detector
DOI: 10.1115/icone18-30173
2010
Isotope-Based Analysis of Nuclear Waste Repository Performance
It is necessary to consider the complexities of both natural and engineered components of a nuclear waste repository since fission products and minor actinides remain harmful to the environment for tens of thousands of years. In safety and performance assessments often used in decision-making about repository designs, the effect of uncertain initial guesses on the models’ output must be understood. As the necessary safe times and hence the simulated times are often in the order of magnitude of hundreds of thousands of years, uncertain initial values become increasingly important. To minimize the danger from high-level radioactive waste and to make informed decisions over designs, sensitivity analysis of the models used should be performed. The Simplified Total System Performance Assessment (STSPA) model developed by Golder Associates Inc., Booz-Allen Hamilton, Stone and Webster and the University of Nevada Reno and used in the Yucca Mountain nuclear waste repository performance assessment is analyzed for sensitivity by varying the activities of technetium-99 and iodine-129 by several orders of magnitude. The resultant dose to a maximally-exposed individual over time periods of 100,000 and 1,000,000 years is compared to the relevant regulatory limits. Incorrect estimates can be seen to have large effects on the behavior of the model while the method used allows conclusions to be drawn about the robustness of the model.
DOI: 10.48550/arxiv.2203.01112
2022
Hyperparameter optimization of data-driven AI models on HPC systems
In the European Center of Excellence in Exascale computing "Research on AI- and Simulation-Based Engineering at Exascale" (CoE RAISE), researchers develop novel, scalable AI technologies towards Exascale. This work exercises High Performance Computing resources to perform large-scale hyperparameter optimization using distributed training on multiple compute nodes. This is part of RAISE's work on data-driven use cases which leverages AI- and HPC cross-methods developed within the project. In response to the demand for parallelizable and resource efficient hyperparameter optimization methods, advanced hyperparameter search algorithms are benchmarked and compared. The evaluated algorithms, including Random Search, Hyperband and ASHA, are tested and compared in terms of both accuracy and accuracy per compute resources spent. As an example use case, a graph neural network model known as MLPF, developed for the task of Machine-Learned Particle-Flow reconstruction in High Energy Physics, acts as the base model for optimization. Results show that hyperparameter optimization significantly increased the performance of MLPF and that this would not have been possible without access to large-scale High Performance Computing resources. It is also shown that, in the case of MLPF, the ASHA algorithm in combination with Bayesian optimization gives the largest performance increase per compute resources spent out of the investigated algorithms.
DOI: 10.48550/arxiv.2203.08161
2022
Sensitivity Estimation for Dark Matter Subhalos in Synthetic Gaia DR2 using Deep Learning
The abundance of dark matter (DM) subhalos orbiting a host galaxy is a generic prediction of the cosmological framework, and is a promising way to constrain the nature of DM. In this paper, we investigate the use of machine learning-based tools to quantify the magnitude of phase-space perturbations caused by the passage of DM subhalos. A simple binary classifier and an anomaly detection model are proposed to estimate if stars or star particles close to DM subhalos are statistically detectable in simulations. The simulated datasets are three Milky Way-like galaxies and nine synthetic Gaia DR2 surveys derived from these. Firstly, we find that the anomaly detection algorithm, trained on a simulated galaxy with full 6D kinematic observables and applied on another galaxy, is nontrivially sensitive to the DM subhalo population. On the other hand, the classification-based approach is not sufficiently sensitive due to the extremely low statistics of signal stars for supervised training. Finally, the sensitivity of both algorithms in the Gaia-like surveys is negligible. The enormous size of the Gaia dataset motivates the further development of scalable and accurate data analysis methods that could be used to select potential regions of interest for DM searches to ultimately constrain the Milky Way's subhalo mass function, as well as simulations where to study the sensitivity of such methods under different signal hypotheses.
DOI: 10.13182/physor22-37372
2022
Predicting the Asymptotic State of Reactor Transients Using Supervised Learning
DOI: 10.3929/ethz-b-000276848
2018
Search for the production of the Higgs boson in association with a top quark pair with CMS at √s=13 TeV
DOI: 10.3929/ethz-b-000235748
2018
Search for resonant and nonresonant Higgs boson pair production in the bbℓνℓν final state in proton-proton collisions at s√=13 TeV
DOI: 10.5506/aphyspolbsupp.11.249
2018
Search for $t\bar {t}H$, $H \rightarrow b\bar {b}$ at the CMS Experiment in 2016 Using 12.9 fb$^{-1}$ of $pp$ Collision Data
DOI: 10.3929/ethz-b-000345484
2018
Search for new long-lived particles at s=13 TeV
2019
hepaccelerate: Fast Analysis of Columnar Collider Data
At HEP experiments, processing terabytes of structured numerical event data to a few statistical summaries is a common task. This step involves selecting events and objects within the event, reconstructing high-level variables, evaluating multivariate classifiers with up to hundreds of variations and creating thousands of low-dimensional histograms. Currently, this is done using multi-step workflows and batch jobs. Based on the CMS search for H(μμ), we demonstrate that it is possible to carry out significant parts of a real collider analysis at a rate of up to a million events per second on a single multicore server with optional GPU acceleration. This is achieved by representing HEP event data as memory-mappable sparse arrays, and by expressing common analysis operations as kernels that can be parallelized across the data using multithreading. We find that only a small number of relatively simple kernels are needed to implement significant parts of this Higgs analysis. Therefore, analysis of real collider datasets of billions events could be done within minutes to a few hours using simple multithreaded codes, reducing the need for managing distributed workflows in the exploratory phase. This approach could speed up the cycle for delivering physics results at HEP experiments. We release the hepaccelerate prototype library as a demonstrator of such accelerated computational kernels. We look forward to discussion, further development and use of efficient and easy-to-use software for terabyte-scale high-level data analysis in the physical sciences.
DOI: 10.3929/ethz-b-000304146
2018
Performance of reconstruction and identification of leptons decaying to hadrons and in pp collisions at √s=13 TeV
DOI: 10.3929/ethz-b-000242166
2018
Search for Higgsino pair production in collisions at √s=13 TeV in final states with large missing transverse momentum and two Higgs bosons decaying via H→bb̄
DOI: 10.22323/1.390.0908
2021
Data Analysis with GPU-Accelerated Kernels
At HEP experiments, processing billions of records of structured numerical data can be a bottleneck in the analysis pipeline.This step is typically more complex than current query languages allow, such that numerical codes are used.As highly parallel computing architectures are increasingly important in the computing ecosystem, it may be useful to consider how accelerators such as GPUs can be used for data analysis.Using CMS and ATLAS Open Data, we implement a benchmark physics analysis with GPU acceleration directly in Python based on efficient computational kernels using Numba/LLVM, resulting in an order of magnitude throughput increase over a pure CPUbased approach.We discuss the implementation and performance benchmarks of the physics kernels on CPU and GPU targets.We demonstrate how these kernels are combined to a modern ML-intensive workflow to enable efficient data analysis on high-performance servers and remark on possible operational considerations.
DOI: 10.1145/3457388.3458659
2021
Diolkos
In large networked systems, a sudden increase in traffic could slowdown the network significantly, impacting network quality for multiple users. We present Diolkos, a system that leverages smart switches to dynamically re-reroute data flows in response to drops in performance. In contrast to other techniques, our tool predicts the future throughput at each port in a switch if a data flow were to be sent through it, and updates which port should be taken to maximize throughput. We use several techniques to predict network switch performance on a software defined network (SDN) mimicking topologies commonly found in datacenters. Experimentally, we demonstrate the effectiveness of choosing a port to send flows through based on predicted performance. We found that using a distributed predictive technique achieves a 24% improvement over using a traditional heuristic technique.
DOI: 10.5281/zenodo.4660697
2021
CoffeaTeam/coffea: Release v0.7.2
2021
Explaining machine-learned particle-flow reconstruction
The particle-flow (PF) algorithm is used in general-purpose particle detectors to reconstruct a comprehensive particle-level view of the collision by combining information from different subdetectors. A graph neural network (GNN) model, known as the machine-learned particle-flow (MLPF) algorithm, has been developed to substitute the rule-based PF algorithm. However, understanding the model's decision making is not straightforward, especially given the complexity of the set-to-set prediction task, dynamic graph building, and message-passing steps. In this paper, we adapt the layerwise-relevance propagation technique for GNNs and apply it to the MLPF algorithm to gauge the relevant nodes and features for its predictions. Through this process, we gain insight into the model's decision-making.
2021
Explaining machine-learned particle-flow reconstruction
The particle-flow (PF) algorithm is used in general-purpose particle detectors to reconstruct a comprehensive particle-level view of the collision by combining information from different subdetectors. A graph neural network (GNN) model, known as the machine-learned particle-flow (MLPF) algorithm, has been developed to substitute the rule-based PF algorithm. However, understanding the model's decision making is not straightforward, especially given the complexity of the set-to-set prediction task, dynamic graph building, and message-passing steps. In this paper, we adapt the layerwise-relevance propagation technique for GNNs and apply it to the MLPF algorithm to gauge the relevant nodes and features for its predictions. Through this process, we gain insight into the model's decision-making.
DOI: 10.5281/zenodo.4452282
2021
Simulated particle-level events of ttbar and QCD with PU200 using Pythia8+Delphes3 for machine learned particle flow (MLPF)
Dataset of 50,000 top quark-antiquark (ttbar) and 5,000 QCD events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation. The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation. An explanation of the dataset and how to load it can be found in the included jupyter notebook delphes_dataset.ipynb. The simulated events are stored in Bzip2-compressed python pickle files in tev14_pythia8_{sample}_{seed}_{idx}.pkl.bz2, where {sample} is ttbar or qcd, {seed} is the random seed, {idx} is the file index. Each file contains 100 events. The Pythia8 configs are found in tev14_pythia8_ttbar.py and tev14_pythia8_qcd.py, the Delphes config in delphes_card_CMS_PileUp.tcl.
DOI: 10.5281/zenodo.4452283
2021
Simulated particle-level dataset of ttbar with PU200 using Pythia8+Delphes3 for machine learned particle flow (MLPF)
Dataset of 50,000 top quark-antiquark (ttbar) events produced in proton-proton collisions at 14 TeV, overlaid with minimum bias events corresponding to a pileup of 200 on average. The dataset consists of detector hits as the input, generator particles as the ground truth and reconstructed particles from DELPHES for additional validation.<br> The DELPHES model corresponds to a CMS-like detector with a multi-layered charged particle tracker, an electromagnetic and hadron calorimeter. Pythia8 and Delphes3 were used for the simulation. Each file contains a bzip2-compressed python pickle with the following contents: <pre><code class="language-python">&gt; data = pickle.load(bz2.BZ2File("out/pythia8_ttbar/tev14_pythia8_ttbar_0_0.pkl.bz2", "rb")) # Each file contains lists of arrays X (detector elements), ygen (generator particles) and ycand (rule-based PF particles from Delphes) for 100 events &gt; len(data["ycand"]), len(data["ygen"]), len(data["X"]) 100, 100, 100 #Each element in the list corresponds to an event. The first event in the file contains 5992 detector elements, ygen and ycand are 0-padded to the same length as X &gt; data["X"][0].shape, data["ygen"][0].shape, data["ycand"][0].shape, ((5992, 12), (5992, 7), (5992, 7)) # The X rows are detector elements: calorimeter towers and tracks with the following 12-features (0-padded) # tower: [type==1, Et (GeV), eta, sin phi, cos phi, E (GeV), Eem (GeV), Ehad (GeV), 0, 0, 0, 0] # track: [type==2, pt (GeV), eta, sin phi, cos phi, P (GeV), eta_outer, sin phi_outer, cos phi_outer, charge, is_gen_muon, is_gen_electron] # The ygen (ycand) rows are generator-level truth particles (rule-based PF particles from Delphes) with the following features: # [pid, charge, pt (GeV), eta, sin phi, cos phi, E (GeV)] # pid==0: placeholder/padding entry # pid==1: charged hadrons # pid==2: neutral hadrons # pid==3: photons # pid==4: electrons # pid==5: muons</code></pre>