ϟ

S. Summers

Here are all the papers by S. Summers that you can download and read on OA.mg.
S. Summers’s last known institution is . Download S. Summers PDFs here.

Claim this Profile →
DOI: 10.1038/s42256-021-00356-5
2021
Cited 98 times
Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors
Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton–proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of $${\mathcal{O}}(1)\,\upmu{\rm{s}}$$ is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved. With edge computing on custom hardware, real-time inference with deep neural networks can reach the nanosecond timescale. An important application in this regime is event processing at particle collision detectors like those at the Large Hadron Collider (LHC). To ensure high performance as well as reduced resource consumption, a method is developed, and made available as an extension of the Keras library, to automatically design optimal quantization of the different layers in a deep neural network.
DOI: 10.1088/2632-2153/aba042
2020
Cited 60 times
Compressing deep neural networks on FPGAs to binary and ternary precision with <tt>hls4ml</tt>
We present the implementation of binary and ternary neural networks in the hls4ml library, designed to automatically convert deep neural network models to digital circuits with field-programmable gate arrays (FPGA) firmware. Starting from benchmark models trained with floating point precision, we investigate different strategies to reduce the network's resource consumption by reducing the numerical precision of the network parameters to binary or ternary. We discuss the trade-off between model accuracy and resource consumption. In addition, we show how to balance between latency and accuracy by retaining full precision on a selected subset of network components. As an example, we consider two multiclass classification tasks: handwritten digit recognition with the MNIST data set and jet identification with simulated proton-proton collisions at the CERN Large Hadron Collider. The binary and ternary implementation has similar performance to the higher precision implementation while using drastically fewer FPGA resources.
DOI: 10.1088/2632-2153/ac0ea1
2021
Cited 53 times
Fast convolutional neural networks on FPGAs with hls4ml
Abstract We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks with convolutional layers on field-programmable gate arrays (FPGAs). By extending the hls4ml library, we demonstrate an inference latency of 5 µ s using convolutional architectures, targeting microsecond latency applications like those at the CERN Large Hadron Collider. Considering benchmark models trained on the Street View House Numbers Dataset, we demonstrate various methods for model compression in order to fit the computational constraints of a typical FPGA device used in trigger and data acquisition systems of particle detectors. In particular, we discuss pruning and quantization-aware training, and demonstrate how resource utilization can be significantly reduced with little to no loss in model accuracy. We show that the FPGA critical resource consumption can be reduced by 97% with zero loss in model accuracy, and by 99% when tolerating a 6% accuracy degradation.
DOI: 10.3389/fdata.2020.598927
2021
Cited 41 times
Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics
Graph neural networks have been shown to achieve excellent performance for several crucial tasks in particle physics, such as charged particle tracking, jet tagging, and clustering. An important domain for the application of these networks is the FGPA-based first layer of real-time data filtering at the CERN Large Hadron Collider, which has strict latency and resource constraints. We discuss how to design distance-weighted graph networks that can be executed with a latency of less than 1$\mu\mathrm{s}$ on an FPGA. To do so, we consider a representative task associated to particle reconstruction and identification in a next-generation calorimeter operating at a particle collider. We use a graph network architecture developed for such purposes, and apply additional simplifications to match the computing constraints of Level-1 trigger systems, including weight quantization. Using the $\mathtt{hls4ml}$ library, we convert the compressed models into firmware to be implemented on an FPGA. Performance of the synthesized models is presented both in terms of inference accuracy and resource usage.
DOI: 10.1088/1748-0221/15/05/p05026
2020
Cited 41 times
Fast inference of Boosted Decision Trees in FPGAs for particle physics
We describe the implementation of Boosted Decision Trees in the hls4ml library, which allows the translation of a trained model into FPGA firmware through an automated conversion process. Thanks to its fully on-chip implementation, hls4ml performs inference of Boosted Decision Tree models with extremely low latency. With a typical latency less than 100 ns, this solution is suitable for FPGA-based real-time processing, such as in the Level-1 Trigger system of a collider experiment. These developments open up prospects for physicists to deploy BDTs in FPGAs for identifying the origin of jets, better reconstructing the energies of muons, and enabling better selection of rare signal processes.
DOI: 10.3389/fdata.2022.787421
2022
Cited 23 times
Applications and Techniques for Fast Machine Learning in Science
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science-the concept of integrating powerful ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.
DOI: 10.1038/s42256-022-00441-3
2022
Cited 22 times
Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider
To study the physics of fundamental particles and their interactions, the Large Hadron Collider was constructed at CERN, where protons collide to create new particles measured by detectors. Collisions occur at a frequency of 40 MHz, and with an event size of roughly 1 MB it is impossible to read out and store the generated amount of data from the detector and therefore a multi-tiered, real-time filtering system is required. In this paper, we show how to adapt and deploy deep-learning-based autoencoders for the unsupervised detection of new physics signatures in the challenging environment of a real-time event selection system at the Large Hadron Collider. The first-stage filter, implemented on custom electronics, decides within a few microseconds whether an event should be kept or discarded. At this stage, the rate is reduced from 40 MHz to about 100 kHz. We demonstrate the deployment of an unsupervised selection algorithm on this custom electronics, running in as little as 80 ns and enhancing the signal-over-background ratio by three orders of magnitude. This work enables the practical deployment of these networks during the next data-taking campaign of the Large Hadron Collider. The Large Hadron Collider produces 40 million collision events per second, most of which need to be discarded by a real-time filtering system. Unsupervised deep learning algorithms are developed and deployed on custom electronics to search for rare events indicating new physics, rather than for specific events led by theory.
DOI: 10.1109/tns.2021.3087100
2021
Cited 27 times
A Reconfigurable Neural Network ASIC for Detector Front-End Data Compression at the HL-LHC
Despite advances in the programmable logic capabilities of modern trigger systems, a significant bottleneck remains in the amount of data to be transported from the detector to off-detector logic where trigger decisions are made. We demonstrate that a neural network autoencoder model can be implemented in a radiation tolerant ASIC to perform lossy data compression alleviating the data transmission problem while preserving critical information of the detector energy profile. For our application, we consider the high-granularity calorimeter from the CMS experiment at the CERN Large Hadron Collider. The advantage of the machine learning approach is in the flexibility and configurability of the algorithm. By changing the neural network weights, a unique data compression algorithm can be deployed for each sensor in different detector regions, and changing detector or collider conditions. To meet area, performance, and power constraints, we perform a quantization-aware training to create an optimized neural network hardware implementation. The design is achieved through the use of high-level synthesis tools and the hls4ml framework, and was processed through synthesis and physical layout flows based on a LP CMOS 65 nm technology node. The flow anticipates 200 Mrad of ionizing radiation to select gates, and reports a total area of 3.6 mm^2 and consumes 95 mW of power. The simulated energy consumption per inference is 2.4 nJ. This is the first radiation tolerant on-detector ASIC implementation of a neural network that has been designed for particle physics applications.
DOI: 10.1088/1748-0221/12/12/p12019
2017
Cited 29 times
An FPGA based track finder for the L1 trigger of the CMS experiment at the High Luminosity LHC
A new tracking detector is under development for use by the CMS experiment at the High-Luminosity LHC (HL-LHC). A crucial requirement of this upgrade is to provide the ability to reconstruct all charged particle tracks with transverse momentum above 2–3 GeV within 4 μs so they can be used in the Level-1 trigger decision. A concept for an FPGA-based track finder using a fully time-multiplexed architecture is presented, where track candidates are reconstructed using a projective binning algorithm based on the Hough Transform, followed by a combinatorial Kalman Filter. A hardware demonstrator using MP7 processing boards has been assembled to prove the entire system functionality, from the output of the tracker readout boards to the reconstruction of tracks with fitted helix parameters. It successfully operates on one eighth of the tracker solid angle acceptance at a time, processing events taken at 40 MHz, each with up to an average of 200 superimposed proton-proton interactions, whilst satisfying the latency requirement. The demonstrated track-reconstruction system, the chosen architecture, the achievements to date and future options for such a system will be discussed.
2020
Cited 18 times
Automatic deep heterogeneous quantization of Deep Neural Networks for ultra low-area, low-latency inference on the edge at particle colliders
While the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference i.e. reduction in model size, latency and energy consumption. A technique to limit model size is quantization, i.e. using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a novel method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of ${\mathcal O}(1)~\mu$s is required. Nanosecond inference and a resource consumption reduced by a factor of $50$ when implemented on FPGA hardware is achieved.
2020
Cited 16 times
Ultra Low-latency, Low-area Inference Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml
In this paper, we introduce the QKeras library, an extension of the Keras library allowing for the creation of heterogeneously quantized versions of deep neural network models, through drop-in replacement of Keras layers. These models are trained quantization-aware, where the user can trade off model area or energy consumption by accuracy. We demonstrate how the reduction of numerical precision, through quantization-aware training, significantly reduces resource consumption while retaining high accuracy when implemented on FPGA hardware. Together with the hls4ml library, this allows for a fully automated deployment of quantized Keras models on chip, crucial for ultra low-latency inference. As a benchmark problem, we consider a classification task for the triggering of events in proton-proton collisions at the CERN Large Hadron Collider, where a latency of ${\mathcal O}(1)~\mu$s is required.
DOI: 10.48550/arxiv.2012.01563
2020
Cited 15 times
Accelerated Charged Particle Tracking with Graph Neural Networks on FPGAs
We develop and study FPGA implementations of algorithms for charged particle tracking based on graph neural networks. The two complementary FPGA designs are based on OpenCL, a framework for writing programs that execute across heterogeneous platforms, and hls4ml, a high-level-synthesis-based compiler for neural network to firmware conversion. We evaluate and compare the resource usage, latency, and tracking performance of our implementations based on a benchmark dataset. We find a considerable speedup over CPU-based execution is possible, potentially enabling such algorithms to be used effectively in future computing workflows and the FPGA-based Level-1 trigger at the CERN Large Hadron Collider.
DOI: 10.1109/asap52443.2021.00025
2021
Cited 13 times
Accelerating Recurrent Neural Networks for Gravitational Wave Experiments
This paper presents novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.
DOI: 10.1088/2632-2153/acc0d7
2023
Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml
Abstract Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers—long short-term memory and gated recurrent unit—within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.
DOI: 10.48550/arxiv.2402.01876
2024
Sets are all you need: Ultrafast jet classification on FPGAs for HL-LHC
We study various machine learning based algorithms for performing accurate jet flavor classification on field-programmable gate arrays and demonstrate how latency and resource consumption scale with the input size and choice of algorithm. These architectures provide an initial design for models that could be used for tagging at the CERN LHC during its high-luminosity phase. The high-luminosity upgrade will lead to a five-fold increase in its instantaneous luminosity for proton-proton collisions and, in turn, higher data volume and complexity, such as the availability of jet constituents. Through quantization-aware training and efficient hardware implementations, we show that O(100) ns inference of complex architectures such as deep sets and interaction networks is feasible at a low computational resource cost.
DOI: 10.1051/epjconf/202429502024
2024
Reconstructing jets in the Phase-2 upgrade of the CMS Level-1 Trigger with a seeded cone algorithm
The Phase-2 Upgrade of the CMS Level-1 Trigger (L1T) will reconstruct particles using the Particle Flow algorithm, connecting information from the tracker, muon, and calorimeter detectors, and enabling fine-grained reconstruction of high level physics objects like jets. We have developed a jet reconstruction algorithm using a cone centred on an energetic seed from these Particle Flow candidates. The implementation is designed to find up to 16 jets in each Xilinx Ultrascale+ FPGA, with a latency of less than 1 µs, and event throughput of 6.7 MHz to fit within the L1T system constraints. Pipelined processing enables reconstruction of jet collections with different cone sizes for little additional resource cost. The design of the algorithm also provides a platform for additional computation using the jet constituents, such as jet tagging using neural networks. We will describe the implementation, its jet reconstruction performance, computational metrics, and the developments towards jet tagging.
DOI: 10.1109/rtc.2016.7543102
2016
Cited 9 times
An FPGA-based track finder for the L1 trigger of the CMS experiment at the high luminosity LHC
A new tracking system is under development for operation in the CMS experiment at the High Luminosity LHC. It includes an outer tracker which will construct stubs, built by correlating clusters in two closely spaced sensor layers for the rejection of hits from low transverse momentum tracks, and transmit them off-detector at 40 MHz. If tracker data is to contribute to keeping the Level-1 trigger rate at around 750 kHz under increased luminosity, a crucial component of the upgrade will be the ability to identify tracks with transverse momentum above 3 GeV/c by building tracks out of stubs. A concept for an FPGA-based track finder using a fully time-multiplexed architecture is presented, where track candidates are identified using a projective binning algorithm based on the Hough Transform. A hardware system based on the MP7 MicroTCA processing card has been assembled, demonstrating a realistic slice of the track finder in order to help gauge the performance and requirements for a full system. This paper outlines the system architecture and algorithms employed, highlighting some of the first results from the hardware demonstrator and discusses the prospects and performance of the completed track finder.
1991
Cited 20 times
The use of machine learning program LERS-LB 2.5 in knowledge acquisition for expert system development in nursing.
LERS-LB (Learning from Examples using Rough Sets Lower Boundaries) is a computer program based on rough set theory for knowledge acquisition, which extracts patterns from real-world data in generating production rules for expert system development. From LERS-LB evaluation of an SPSS-X data file containing data for recovery room patients, it was concluded that both statistical data files and existing databases can be converted to decision-table format needed by LERS-LB, but it is less desirable to work with statistical files than a well-developed database. It was also concluded that choosing a well-developed database and checking it thoroughly for accuracy and completeness should be done before running LERS-LB, or other learning programs, to avoid problems with data errors. Using rough set theory and a technique called 'dropping conditions' LERS-LB offers, at least in theory, a possible method for identifying which data items are critical to nursing practice. Further research and continued LERS-LB program enhancements still may help with identifying critical data items versus redundant data for nursing practice. LERS-LB, and other learning programs, offer techniques which will help reduce the knowledge acquisition bottleneck in nursing expert system development. It is doubtful, however, that learning programs will eliminate the need for involving domain experts in evaluating rules and expert systems for clinical decision support.
DOI: 10.48550/arxiv.2206.07527
2022
Cited 3 times
QONNX: Representing Arbitrary-Precision Quantized Neural Networks
We present extensions to the Open Neural Network Exchange (ONNX) intermediate representation format to represent arbitrary-precision quantized neural networks. We first introduce support for low precision quantization in existing ONNX-based quantization formats by leveraging integer clipping, resulting in two new backward-compatible variants: the quantized operator format with clipping and quantize-clip-dequantize (QCDQ) format. We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators -- Quant, BipolarQuant, and Trunc -- in order to represent uniform quantization. By keeping the QONNX IR high-level and flexible, we enable targeting a wider variety of platforms. We also present utilities for working with QONNX, as well as examples of its usage in the FINN and hls4ml toolchains. Finally, we introduce the QONNX model zoo to share low-precision quantized neural networks.
DOI: 10.1088/2632-2153/ac9cb5
2022
Cited 3 times
Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml
Abstract In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
DOI: 10.1088/1748-0221/12/02/c02015
2017
Cited 5 times
Using MaxCompiler for the high level synthesis of trigger algorithms
Firmware for FPGA trigger applications at the CMS experiment is conventionally written using hardware description languages such as Verilog and VHDL. MaxCompiler is an alternative, Java based, tool for developing FPGA applications which uses a higher level of abstraction from the hardware than a hardware description language. An implementation of the jet and energy sum algorithms for the CMS Level-1 calorimeter trigger has been written using MaxCompiler to benchmark against the VHDL implementation in terms of accuracy, latency, resource usage, and code size. A Kalman Filter track fitting algorithm has been developed using MaxCompiler for a proposed CMS Level-1 track trigger for the High-Luminosity LHC upgrade. The design achieves a low resource usage, and has a latency of 187.5 ns per iteration.
DOI: 10.1145/3289602.3293986
2019
Cited 4 times
Fast Inference of Deep Neural Networks for Real-time Particle Physics Applications
Machine learning methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of such techniques in low-latency, low-power FPGA (Field Programmable Gate Array) hardware has only just begun. FPGA-based trigger and data acquisition systems have extremely low, sub-microsecond latency requirements that are unique to particle physics. We present a case study for neural network inference in FPGAs focusing on a classifier for jet substructure which would enable many new physics measurements. While we focus on a specific example, the lessons are far-reaching. A compiler package is developed based on High-Level Synthesis (HLS) called HLS4ML to build machine learning models in FPGAs. The use of HLS increases accessibility across a broad user community and allows for a drastic decrease in firmware development time. We map out FPGA resource usage and latency versus neural network hyperparameters to allow for directed resource tuning in the low latency environment and assess the impact on our benchmark Physics performance scenario For our example jet substructure model, we fit well within the available resources of modern FPGAs with latency on the scale of 100~ns.
DOI: 10.48550/arxiv.2008.13636
2020
Cited 4 times
HL-LHC Computing Review: Common Tools and Community Software
Common and community software packages, such as ROOT, Geant4 and event generators have been a key part of the LHC's success so far and continued development and optimisation will be critical in the future. The challenges are driven by an ambitious physics programme, notably the LHC accelerator upgrade to high-luminosity, HL-LHC, and the corresponding detector upgrades of ATLAS and CMS. In this document we address the issues for software that is used in multiple experiments (usually even more widely than ATLAS and CMS) and maintained by teams of developers who are either not linked to a particular experiment or who contribute to common software within the context of their experiment activity. We also give space to general considerations for future software and projects that tackle upcoming challenges, no matter who writes it, which is an area where community convergence on best practice is extremely useful.
DOI: 10.23919/fpl.2017.8056825
2017
Cited 3 times
A novel FPGA-based track reconstruction approach for the level-1 trigger of the CMS experiment at CERN
The Compact Muon Solenoid (CMS) experiment at CERN is scheduled for a major upgrade in the next decade in order to meet the demands of the new High Luminosity Large Hadron Collider.Amongst others, a new tracking system is under development including an outer tracker capable of rejecting low transverse momentum particles by looking at the coincidences of hits (stubs) in two closely spaced sensor layers in the same tracker module.Accepted stubs are transmitted off-detector for further processing at 40 MHz.In order to maintain under the increased luminosity the Level-1 trigger rate at 750 kHz, tracker data need to be included in the decision making process.For this purpose, a system architecture has to be developed that will be able to identify particles with transverse momentum above 3 GeV/c by building tracks out of stubs, while achieving an overall processing latency of maximum 4us.Targeting these requirements the current paper presents an FPGA-based track finding architecture that identifies track candidates in real-time and bases its functionality on a fully time-multiplexed approach.As a proof of concept, a hardware system has been assembled targeting the MP7 MicroTCA processing card that features a Xilinx Virtex-7 FPGA, demonstrating a realistic slice of the track finder.The paper discusses the algorithms' implementation and the efficient utilisation of the available FPGA resources, it outlines the system architecture, and presents some of the hardware demonstrator results.
DOI: 10.1051/epjconf/201921401003
2019
Cited 3 times
Kalman Filter track reconstruction on FPGAs for acceleration of the High Level Trigger of the CMS experiment at the HL-LHC
Track reconstruction at the CMS experiment uses the Combinatorial Kalman Filter. The algorithm computation time scales exponentially with pileup, which will pose a problem for the High Level Trigger at the High Luminosity LHC. FPGAs, which are already used extensively in hardware triggers, are becoming more widely used for compute acceleration. With a combination of high performance, energy efficiency, and predictable and low latency, FPGA accelerators are an interesting technology for high energy physics. Here, progress towards porting of the CMS track reconstruction to Maxeler Technologies’ Dataflow Engines is shown, programmed with their high level language MaxJ. The performance is compared to CPUs, and further steps to optimise for the architecture are presented.
2021
Cited 3 times
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.
DOI: 10.1051/epjconf/202125104027
2021
Cited 3 times
Jet Single Shot Detection
We apply object detection techniques based on Convolutional Neural Networks to jet reconstruction and identification at the CERN Large Hadron Collider. In particular, we focus on CaloJet reconstruction, representing each event as an image composed of calorimeter cells and using a Single Shot Detection network, called Jet-SSD. The model performs simultaneous localization and classification and additional regression tasks to measure jet features. We investigate Ternary Weight Networks with weights constrained to {-1, 0, 1} times a layer- and channel-dependent scaling factors. We show that the quantized version of the network closely matches the performance of its full-precision equivalent.
DOI: 10.1088/1742-6596/2438/1/012106
2023
Neural Network-Based Primary Vertex Reconstruction with FPGAs for the Upgrade of the CMS Level-1 Trigger System
Abstract The CMS experiment will be upgraded to maintain physics sensitivity and exploit the improved performance of the High Luminosity LHC. Part of this upgrade will see the first level (Level-1) trigger use charged particle tracks reconstructed within the full outer silicon tracker volume as an input for the first time and new algorithms are being designed to make use of these tracks. One such algorithm is primary vertex finding which is used to identify the hard scatter in an event and separate the primary interaction from additional simultaneous interactions. This work presents a novel approach to regress the primary vertex position and to reject tracks from additional soft interactions, which uses an end-to-end neural network. This neural network possesses simultaneous knowledge of all stages in the reconstruction chain, which allows for end-to-end optimisation. The improved performance of this network versus a baseline approach in the primary vertex regression and track-to-vertex classification is shown. A quantised and pruned version of the neural network is deployed on an FPGA to match the stringent timing and computing requirements of the Level-1 Trigger.
DOI: 10.1088/2632-2153/acc0d7/v2/response1
2023
Author response for "Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml"
DOI: 10.5281/zenodo.7969364
2023
Realizing Attosecond Core-Level X-ray Spectroscopy for the Investigation of Condensed Matter Systems
DOI: 10.48550/arxiv.2310.08062
2023
Reconstructing jets in the Phase-2 upgrade of the CMS Level-1 Trigger with a seeded cone algorithm
The Phase-2 Upgrade of the CMS Level-1 Trigger (L1T) will reconstruct particles using the Particle Flow algorithm, connecting information from the tracker, muon, and calorimeter detectors, and enabling fine-grained reconstruction of high level physics objects like jets. We have developed a jet reconstruction algorithm using a cone centred on an energetic seed from these Particle Flow candidates. The implementation is designed to find up to 16 jets in each Xilinx Ultrascale+ FPGA, with a latency of less than 1 {\mu}s, and event throughput of 6.7 MHz to fit within the L1T system constraints. Pipelined processing enables reconstruction of jet collections with different cone sizes for little additional resource cost. The design of the algorithm also provides a platform for additional computation using the jet constituents, such as jet tagging using neural networks. We will describe the implementation, its jet reconstruction performance, computational metrics, and the developments towards jet tagging.
DOI: 10.25560/66689
2018
Application of FPGAs to Triggering in High Energy Physics
The High Luminosity upgrade of the LHC will increase the instantaneous luminosity to 5 × 10 34 cm −2 s −1 , resulting in an increase in the number of simultaneous proton-proton collisions per event (pileup) to the range 140-200. The CMS Level 1 Trigger system will be upgraded, and will reduce the 40 MHz event rate to 750 kHz. The system will perform a fast event reconstruction on FPGA devices, and select events for read out within a latency of 12.5 μs. The high level FPGA programming tool MaxCompiler is investigated for use in Level 1 Trigger applications. An existing trigger algorithm, originally developed with a Hardware Description Language, is reimplemented using MaxCompiler and compared to the original. Bitwise agreement between the outputs is observed, with half as many lines of code, at the expense of some extra FPGA resources. A hardware demonstration of a proposed Level 1 track reconstruction is presented, with a Kalman Filter track fit developed with MaxCompiler. The performance of the tracking is investigated, as well as the potential for developing advanced algorithms with low latency using the tool. A high tracking efficiency, and precise parameter resolutions, are achieved with a 3.7 μs latency in high pileup events. A boosted decision tree classifier, implemented with inference latency of a few clockcycles, is presented as a means to reject fake tracks. After the Level 1 Trigger, events are further processed on commodity PCs in the High Level Trigger (HLT). The High Luminosity LHC will also challenge the HLT, which is projected to require twenty times the processing power used during LHC Run II. Part of the HLT tracking is ported to Maxeler Dataflow Engines (DFEs), a hardware acceleration technology. A faster rate of processing is achieved, but with an initial latency of the host-DFE communication that limits the performance. Steps which might yield acceleration are identified.
DOI: 10.1109/rtc.2016.7543110
2016
Emulation of a prototype FPGA track finder for the CMS Phase-2 upgrade with the CIDAF emulation framework
The CMS collaboration is preparing a major upgrade of its detector, so it can operate during the high luminosity run of the LHC from 2026. The upgraded tracker electronics will reconstruct the trajectories of charged particles within a latency of a few microseconds, so that they can be used by the level-1 trigger. An emulation framework, CIDAF, has been developed to provide a reference for a proposed FPGA-based implementation of this track finder, which employs a Time-Multiplexed (TM) technique for data processing.
DOI: 10.1088/2632-2153/ac7a02
2022
Lightweight jet reconstruction and identification as an object detection task
Abstract We apply object detection techniques based on deep convolutional blocks to end-to-end jet identification and reconstruction tasks encountered at the CERN large hadron collider (LHC). Collision events produced at the LHC and represented as an image composed of calorimeter and tracker cells are given as an input to a Single Shot Detection network. The algorithm, named PFJet-SSD performs simultaneous localization, classification and regression tasks to cluster jets and reconstruct their features. This all-in-one single feed-forward pass gives advantages in terms of execution time and an improved accuracy w.r.t. traditional rule-based methods. A further gain is obtained from network slimming, homogeneous quantization, and optimized runtime for meeting memory and latency constraints of a typical real-time processing environment. We experiment with 8-bit and ternary quantization, benchmarking their accuracy and inference latency against a single-precision floating-point. We show that the ternary network closely matches the performance of its full-precision equivalent and outperforms the state-of-the-art rule-based algorithm. Finally, we report the inference latency on different hardware platforms and discuss future applications.
DOI: 10.48550/arxiv.2207.00559
2022
Ultra-low latency recurrent neural network inference on FPGAs for physics applications with hls4ml
Recurrent neural networks have been shown to be effective architectures for many tasks in high energy physics, and thus have been widely adopted. Their use in low-latency environments has, however, been limited as a result of the difficulties of implementing recurrent architectures on field-programmable gate arrays (FPGAs). In this paper we present an implementation of two types of recurrent neural network layers -- long short-term memory and gated recurrent unit -- within the hls4ml framework. We demonstrate that our implementation is capable of producing effective designs for both small and large models, and can be customized to meet specific design requirements for inference latencies and FPGA resources. We show the performance and synthesized designs for multiple neural networks, many of which are trained specifically for jet identification tasks at the CERN Large Hadron Collider.
DOI: 10.62915/2157-0396.2328
2020
Georgia Librarians Working from Home during the COVID-19 Pandemic
DOI: 10.1109/cleo.2001.947682
2001
A 5 gigabit/sec all-optical parallel analog-to-digital converter
Summary form only given. We present the design and demonstration of an all-optical A/D converter whose attractive features include (1) direct optical sampling of the input RF signal, (2) optical quantization scheme, (3) utilization of optical signal multiplexing to provide parallel processing of quantized signals, and (4) low-power consumption.
DOI: 10.48550/arxiv.2202.04499
2022
Lightweight Jet Reconstruction and Identification as an Object Detection Task
We apply object detection techniques based on deep convolutional blocks to end-to-end jet identification and reconstruction tasks encountered at the CERN Large Hadron Collider (LHC). Collision events produced at the LHC and represented as an image composed of calorimeter and tracker cells are given as an input to a Single Shot Detection network. The algorithm, named PFJet-SSD performs simultaneous localization, classification and regression tasks to cluster jets and reconstruct their features. This all-in-one single feed-forward pass gives advantages in terms of execution time and an improved accuracy w.r.t. traditional rule-based methods. A further gain is obtained from network slimming, homogeneous quantization, and optimized runtime for meeting memory and latency constraints of a typical real-time processing environment. We experiment with 8-bit and ternary quantization, benchmarking their accuracy and inference latency against a single-precision floating-point. We show that the ternary network closely matches the performance of its full-precision equivalent and outperforms the state-of-the-art rule-based algorithm. Finally, we report the inference latency on different hardware platforms and discuss future applications.
DOI: 10.1038/s42256-022-00486-4
2022
Author Correction: Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider
2022
Lightweight Jet Reconstruction and Identification as an Object Detection Task
DOI: 10.48550/arxiv.2205.07690
2022
Real-time semantic segmentation on FPGAs for autonomous vehicles with hls4ml
In this paper, we investigate how field programmable gate arrays can serve as hardware accelerators for real-time semantic segmentation tasks relevant for autonomous driving. Considering compressed versions of the ENet convolutional neural network architecture, we demonstrate a fully-on-chip deployment with a latency of 4.9 ms per image, using less than 30% of the available resources on a Xilinx ZCU102 evaluation board. The latency is reduced to 3 ms per image when increasing the batch size to ten, corresponding to the use case where the autonomous vehicle receives inputs from multiple cameras simultaneously. We show, through aggressive filter reduction and heterogeneous quantization-aware training, and an optimized implementation of convolutional layers, that the power consumption and resource utilization can be significantly reduced while maintaining accuracy on the Cityscapes dataset.
DOI: 10.22323/1.313.0131
2018
An FPGA-based Track Finder for the L1 Trigger of the CMS Experiment at the HL-LHC
A new tracking detector is under development for use by the CMS experiment at the High-Luminosity LHC (HL-LHC).A crucial component of this upgrade will be the ability to reconstruct within a few microseconds all charged particle tracks with transverse momentum above 3 GeV, so they can be used in the Level-1 trigger decision.A concept for an FPGA-based track finder using a fully time-multiplexed architecture is presented, where track candidates are reconstructed using a projective binning algorithm based on the Hough Transform followed by a track fitting based on the linear regression technique.A hardware demonstrator using MP7 processing boards has been assembled to prove the entire system, from the output of the tracker readout boards to the reconstruction of tracks with fitted helix parameters.It successfully operates on one eighth of the tracker solid angle at a time, processing events taken at 40 MHz, each with up to 200 superimposed proton-proton interactions, whilst satisfying latency constraints.The demonstrated track-reconstruction system, the chosen architecture, the achievements to date and future options for such a system will be discussed.
DOI: 10.2172/1630707
2019
hls4ml: Deploying Deep Learning on FPGAs for L1 trigger and Data Acquisition
neural network to demonstrate state-of-the-art performance for top quark jet tagging at the LHC, and apply a ResNet-50 model with transfer learning for neutrino event classification. Using Project Brainwave b y Microsoft to accelerate the ResNet-50 image classification model, we achieve average inference times of 60 (10) milliseconds with our experimental physics software framework using Brainwave as a cloud (edge or on-premises) service, representing an improvement by a factor of approximately 30 (175) in model inference latency over traditional CPU inference in current experimental hardware. A single FPGA service accessed by many CPUs achieves a throughput of 600--700 inferences per second using an image batch of one, comparable to large batch-size GPU throughput and significantly better than small batch-size GPU throughput. Deployed as an edge or cloud service for the particle physics computing model, coprocessor accelerators can have a higher duty cycle and are potentially much more cost-effective.
DOI: 10.5281/zenodo.3598989
2019
hls4ml: Deploying Deep Learning on FPGAs for L1 trigger and Data Acquisition [PowerPoint]
2021
hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices
Accessible machine learning algorithms, software, and diagnostic tools for energy-efficient devices and systems are extremely valuable across a broad range of application domains. In scientific domains, real-time near-sensor processing can drastically improve experimental design and accelerate scientific discoveries. To support domain scientists, we have developed hls4ml, an open-source software-hardware codesign workflow to interpret and translate machine learning algorithms for implementation with both FPGA and ASIC technologies. We expand on previous hls4ml work by extending capabilities and techniques towards low-power implementations and increased usability: new Python APIs, quantization-aware pruning, end-to-end FPGA workflows, long pipeline kernels for low power, and new device backends include an ASIC workflow. Taken together, these and continued efforts in hls4ml will arm a new generation of domain scientists with accessible, efficient, and powerful tools for machine-learning-accelerated discovery.
DOI: 10.5281/zenodo.4883651
2021
Jet Single Shot Detection
Data for training Jet Single Shot Detection network for simultaneous localization and classification of jets.
2021
arXiv : Jet Single Shot Detection
2021
Fast RNN Inference on an FPGA
2021
Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider
In this paper, we show how to adapt and deploy anomaly detection algorithms based on deep autoencoders, for the unsupervised detection of new physics signatures in the extremely challenging environment of a real-time event selection system at the Large Hadron Collider (LHC). We demonstrate that new physics signatures can be enhanced by three orders of magnitude, while staying within the strict latency and resource constraints of a typical LHC event filtering system. This would allow for collecting high-statistics, potentially new physics enhanced datasets for novel physics searches. Through per-layer, highly parallel implementations of network layers, support for autoencoder-specific losses on FPGAs and latent space based inference, we demonstrate that anomaly detection can be performed as fast as $80\,$ns using less than 3% of FPGA resources.
2021
Autoencoders on FPGAs for real-time, unsupervised new physics detection at 40 MHz at the Large Hadron Collider
DOI: 10.48550/arxiv.2110.13041
2021
Applications and Techniques for Fast Machine Learning in Science
In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlapping challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs.