ϟ

M. Dobson

Here are all the papers by M. Dobson that you can download and read on OA.mg.
M. Dobson’s last known institution is . Download M. Dobson PDFs here.

Claim this Profile →
DOI: 10.1016/j.nima.2009.04.009
2009
Cited 107 times
Testbeam studies of production modules of the ATLAS Tile Calorimeter
We report test beam studies of 11% of the production ATLAS Tile Calorimeter modules. The modules were equipped with production front-end electronics and all the calibration systems planned for the final detector. The studies used muon, electron and hadron beams ranging in energy from 3 to 350 GeV. Two independent studies showed that the light yield of the calorimeter was ∼70pe/GeV, exceeding the design goal by 40%. Electron beams provided a calibration of the modules at the electromagnetic energy scale. Over 200 calorimeter cells the variation of the response was 2.4%. The linearity with energy was also measured. Muon beams provided an intercalibration of the response of all calorimeter cells. The response to muons entering in the ATLAS projective geometry showed an RMS variation of 2.5% for 91 measurements over a range of rapidities and modules. The mean response to hadrons of fixed energy had an RMS variation of 1.4% for the modules and projective angles studied. The response to hadrons normalized to incident beam energy showed an 8% increase between 10 and 350 GeV, fully consistent with expectations for a noncompensating calorimeter. The measured energy resolution for hadrons of σ/E=52.9%/E⊕5.7% was also consistent with expectations. Other auxiliary studies were made of saturation recovery of the readout system, the time resolution of the calorimeter and the performance of the trigger signals from the calorimeter.
DOI: 10.1016/j.nima.2010.01.037
2010
Cited 40 times
Measurement of pion and proton response and longitudinal shower profiles up to 20 nuclear interaction lengths with the ATLAS Tile calorimeter
The response of pions and protons in the energy range of 20–180 GeV, produced at CERN's SPS H8 test-beam line in the ATLAS iron–scintillator Tile hadron calorimeter, has been measured. The test-beam configuration allowed the measurement of the longitudinal shower development for pions and protons up to 20 nuclear interaction lengths. It was found that pions penetrate deeper in the calorimeter than protons. However, protons induce showers that are wider laterally to the direction of the impinging particle. Including the measured total energy response, the pion-to-proton energy ratio and the resolution, all observations are consistent with a higher electromagnetic energy fraction in pion-induced showers. The data are compared with GEANT4 simulations using several hadronic physics lists. The measured longitudinal shower profiles are described by an analytical shower parametrization within an accuracy of 5–10%. The amount of energy leaking out behind the calorimeter is determined and parametrized as a function of the beam energy and the calorimeter depth. This allows for a leakage correction of test-beam results in the standard projective geometry.
DOI: 10.1016/j.nima.2010.04.054
2010
Cited 39 times
Study of energy response and resolution of the ATLAS barrel calorimeter to hadrons of energies from 20 to 350GeV
A fully instrumented slice of the ATLAS detector was exposed to test beams from the SPS (Super Proton Synchrotron) at CERN in 2004. In this paper, the results of the measurements of the response of the barrel calorimeter to hadrons with energies in the range 20–350 GeV and beam impact points and angles corresponding to pseudo-rapidity values in the range 0.2–0.65 are reported. The results are compared to the predictions of a simulation program using the Geant 4 toolkit.
DOI: 10.1016/j.nuclphysbps.2007.08.004
2007
Cited 29 times
The ATLAS Data Acquisition and Trigger: concept, design and status
This article presents the base-line design and implementation of the ATLAS Trigger and Data Acquisition system, in particular the Data Flow and High Level Trigger components. The status of the installation and commissioning of the system is also presented.
DOI: 10.1051/epjconf/202429502031
2024
Towards a container-based architecture for CMS data acquisition
The CMS data acquisition (DAQ) is implemented as a service-oriented architecture where DAQ applications, as well as general applications such as monitoring and error reporting, are run as self-contained services. The task of deployment and operation of services is achieved by using several heterogeneous facilities, custom configuration data and scripts in several languages. In this work, we restructure the existing system into a homogeneous, scalable cloud architecture adopting a uniform paradigm, where all applications are orchestrated in a uniform environment with standardized facilities. In this new paradigm DAQ applications are organized as groups of containers and the required software is packaged into container images. Automation of all aspects of coordinating and managing containers is provided by the Kubernetes environment, where a set of physical and virtual machines is unified in a single pool of compute resources. We demonstrate that a container-based cloud architecture provides an acrossthe-board solution that can be applied for DAQ in CMS. We show strengths and advantages of running DAQ applications in a container infrastructure as compared to a traditional application model.
DOI: 10.1051/epjconf/202429502013
2024
First year of experience with the new operational monitoring tool for data taking in CMS during Run 3
The Online Monitoring System (OMS) at the Compact Muon Solenoid experiment (CMS) at CERN aggregates and integrates different sources of information into a central place and allows users to view, compare and correlate information. It displays real-time and historical information. The tool is heavily used by run coordinators, trigger experts and shift crews, to ensure the quality and efficiency of data taking. It provides aggregated information for many use cases including data certification. OMS is the successor of Web Based Monitoring (WBM), which was in use during Run 1 and Run 2 of the LHC. WBM started as a small tool and grew substantially over the years so that maintenance became challenging. OMS was developed from scratch following several design ideas: to strictly separate the presentation layer from the data aggregation layer, to use a well-defined standard for the communication between presentation layer and aggregation layer, and to employ widely used frameworks from outside the HEP community. A report on the experience from the operation of OMS for the first year of data taking of Run 3 in 2022 is presented.
DOI: 10.1051/epjconf/202429502020
2024
MiniDAQ-3: Providing concurrent independent subdetector data-taking on CMS production DAQ resources
The data acquisition (DAQ) of the Compact Muon Solenoid (CMS) experiment at CERN, collects data for events accepted by the Level-1 Trigger from the different detector systems and assembles them in an event builder prior to making them available for further selection in the High Level Trigger, and finally storing the selected events for offline analysis. In addition to the central DAQ providing global acquisition functionality, several separate, so-called “MiniDAQ” setups allow operating independent data acquisition runs using an arbitrary subset of the CMS subdetectors. During Run 2 of the LHC, MiniDAQ setups were running their event builder and High Level Trigger applications on dedicated resources, separate from those used for the central DAQ. This cleanly separated MiniDAQ setups from the central DAQ system, but also meant limited throughput and a fixed number of possible MiniDAQ setups. In Run 3, MiniDAQ-3 setups share production resources with the new central DAQ system, allowing each setup to operate at the maximum Level-1 rate thanks to the reuse of the resources and network bandwidth. Configuration management tools had to be significantly extended to support the synchronization of the DAQ configurations needed for the various setups. We report on the new configuration management features and on the first year of operational experience with the new MiniDAQ-3 system.
DOI: 10.1051/epjconf/202429502011
2024
The CMS Orbit Builder for the HL-LHC at CERN
The Compact Muon Solenoid (CMS) experiment at CERN incorporates one of the highest throughput data acquisition systems in the world and is expected to increase its throughput by more than a factor of ten for High-Luminosity phase of Large Hadron Collider (HL-LHC). To achieve this goal, the system will be upgraded in most of its components. Among them, the event builder software, in charge of assembling all the data read out from the different sub-detectors, is planned to be modified from a single event builder to an orbit builder that assembles multiple events at the same time. The throughput of the event builder will be increased from the current 1.6 Tb/s to 51 Tb/s for the HL-LHC orbit builder. This paper presents preliminary network transfer studies in preparation for the upgrade. The key conceptual characteristics are discussed, concerning differences between the CMS event builder in Run 3 and the CMS Orbit Builder for the HL-LHC. For the feasibility studies, a pipestream benchmark, mimicking event-builder-like traffic has been developed. Preliminary performance tests and results are discussed.
DOI: 10.1088/1748-0221/17/05/c05003
2022
Cited 6 times
CMS phase-2 DAQ and timing hub prototyping results and perspectives
Abstract This paper describes recent progress on the design of the DAQ and Timing Hub, or DTH, an ATCA (Advanced Telecommunications Computing Architecture) hub board intended for the phase-2 upgrade of the CMS experiment. Prototyping was originally divided into multiple feature lines, spanning all different aspects of the DTH functionality. The second DTH prototype merges all R&D and prototyping lines into a single board, which is intended to be the production candidate. Emphasis is on the process and experience in going from the first to the second DTH prototype, which included a change of the chosen FPGA as well as the integration of a commercial networking solution.
DOI: 10.1016/j.nima.2009.05.158
2009
Cited 16 times
Study of the response of the ATLAS central calorimeter to pions of energies from 3 to 9GeV
A fully instrumented slice of the ATLAS central detector was exposed to test beams from the SPS (Super Proton Synchrotron) at CERN in 2004. In this paper, the response of the central calorimeters to pions with energies in the range between 3 and 9 GeV is presented. The linearity and the resolution of the combined calorimetry (electromagnetic and hadronic calorimeters) was measured and compared to the prediction of a detector simulation program using the toolkit Geant 4.
DOI: 10.1088/1748-0221/8/12/c12039
2013
Cited 12 times
10 Gbps TCP/IP streams from the FPGA for the CMS DAQ eventbuilder network
For the upgrade of the DAQ of the CMS experiment in 2013/2014 an interface between the custom detector Front End Drivers (FEDs) and the new DAQ eventbuilder network has to be designed. For a loss-less data collection from more then 600 FEDs a new FPGA based card implementing the TCP/IP protocol suite over 10Gbps Ethernet has been developed. We present the hardware challenges and protocol modifications made to TCP in order to simplify its FPGA implementation together with a set of performance measurements which were carried out with the current prototype.
DOI: 10.1109/tns.2015.2426216
2015
Cited 12 times
The New CMS DAQ System for Run-2 of the LHC
The data acquisition (DAQ) system of the CMS experiment at the CERN Large Hadron Collider assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GB/s to the high level trigger (HLT) farm. The HLT farm selects interesting events for storage and offline analysis at a rate of around 1 kHz. The DAQ system has been redesigned during the accelerator shutdown in 2013/14. The motivation is twofold: Firstly, the current compute nodes, networking, and storage infrastructure will have reached the end of their lifetime by the time the LHC restarts. Secondly, in order to handle higher LHC luminosities and event pileup, a number of sub-detectors will be upgraded, increasing the number of readout channels and replacing the off-detector readout electronics with a <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$\mu {\hbox {TCA}}$</tex></formula> implementation. The new DAQ architecture will take advantage of the latest developments in the computing industry. For data concentration, 10/40 Gb/s Ethernet technologies will be used, as well as an implementation of a reduced TCP/IP in FPGA for a reliable transport between custom electronics and commercial computing hardware. A Clos network based on 56 Gb/s FDR Infiniband has been chosen for the event builder with a throughput of <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$\sim 4~\hbox{Tb/s}$</tex> </formula> . The HLT processing is entirely file based. This allows the DAQ and HLT systems to be independent, and to use the HLT software in the same way as for the offline processing. The fully built events are sent to the HLT with 1/10/40 Gb/s Ethernet via network file systems. Hierarchical collection of HLT accepted events and monitoring meta-data are stored into a global file system. This paper presents the requirements, technical choices, and performance of the new system.
DOI: 10.1088/1742-6596/513/1/012042
2014
Cited 11 times
10 Gbps TCP/IP streams from the FPGA for High Energy Physics
The DAQ system of the CMS experiment at CERN collects data from more than 600 custom detector Front-End Drivers (FEDs). During 2013 and 2014 the CMS DAQ system will undergo a major upgrade to address the obsolescence of current hardware and the requirements posed by the upgrade of the LHC accelerator and various detector components. For a loss-less data collection from the FEDs a new FPGA based card implementing the TCP/IP protocol suite over 10Gbps Ethernet has been developed. To limit the TCP hardware implementation complexity the DAQ group developed a simplified and unidirectional but RFC 793 compliant version of the TCP protocol. This allows to use a PC with the standard Linux TCP/IP stack as a receiver. We present the challenges and protocol modifications made to TCP in order to simplify its FPGA implementation. We also describe the interaction between the simplified TCP and Linux TCP/IP stack including the performance measurements.
DOI: 10.1016/j.nima.2022.167805
2023
A 40 MHz Level-1 trigger scouting system for the CMS Phase-2 upgrade
The CMS Phase-2 upgrade for the HL-LHC aims at preserving and expanding the current physics capability of the experiment under extreme pileup conditions. A new tracking system incorporates a track finder processor, providing tracks to the Level-1 (L1) trigger. A new high-granularity calorimeter provides fine-grained energy deposition information in the endcap region. New front-end and back-end electronics feed the L1 trigger with high-resolution information from the barrel calorimeter and the muon systems. The upgraded L1 will be based primarily on the Xilinx Ultrascale Plus series of FPGAs, capable of sophisticated feature searches with resolution often similar to the offline reconstruction. The L1 Data Scouting system (L1DS) will capture L1 intermediate data produced by the trigger processors at the beam-crossing rate of 40 MHz, and carry out online analyses based on these limited-resolution data. The L1DS will provide fast and virtually unlimited statistics for detector diagnostics, alternative luminosity measurements, and, in some cases, calibrations. It also has the potential to enable the study of otherwise inaccessible signatures, either too common to fit in the L1 trigger accept budget or with requirements that are orthogonal to “mainstream” physics. The requirements and architecture of the L1DS system are presented, as well as some of the potential physics opportunities under study. The first results from the assembly and commissioning of a demonstrator currently being installed for LHC Run-3 are also presented. The demonstrator collects data from the Global Muon Trigger, the Layer-2 Calorimeter Trigger, the Barrel Muon Track Finder, and the Global Trigger systems of the current CMS L1. This demonstrator, as a data acquisition (DAQ) system operating at the LHC bunch-crossing rate, faces many of the challenges of the Phase-2 system, albeit with scaled-down connectivity, reduced data throughput and physics capabilities, providing a testing ground for new techniques of online data reduction and processing.
DOI: 10.1109/tns.2004.828793
2004
Cited 16 times
Online software for the ATLAS test beam data acquisition system
The Online Software is the global system software of the ATLAS data acquisition (DAQ) system, responsible for the configuration, control and information sharing of the ATLAS DAQ System. A test beam facility offers the ATLAS detectors the possibility to study important performance aspects as well as to proceed on the way to the final ATLAS DAQ system. Last year, three subdetectors of ATLAS-separately and combined-were successfully using the Online Software for the control of their datataking. In this paper, we describe the different components of the Online Software together with their usage at the ATLAS test beam.
2010
Cited 11 times
Response and shower topology of 2 to 180 GeV pions measured with the ATLAS barrel calorimeter at the CERN test-beam and comparison to Monte Carlo simulations
The response of the ATLAS barrel calorimeter to pions with momenta from 2 to 180 GeV is studied in a test–beam at the CERN H8 beam line. The mean energy, the energy resolution and the longitudinal and radial shower profiles, and, various observables characterising the shower topology in the calorimeter are measured. The data are compared to Monte Carlo simulations based on a detailed description of the experimental set–up and on various models describing the interaction of particles with matter based on Geant4.
DOI: 10.1109/nssmic.2015.7581984
2015
Cited 8 times
The CMS Timing and Control Distribution System
The Compact Muon Solenoid (CMS) experiment operating at the CERN (European Laboratory for Nuclear Physics) Large Hadron Collider (LHC) is in the process of upgrading several of its detector systems. Adding more individual detector components brings the need to test and commission those components separately from existing ones so as not to compromise physics data-taking. The CMS Trigger, Timing and Control (TTC) system had reached its limits in terms of the number of separate elements (partitions) that could be supported. A new Timing and Control Distribution System (TCDS) has been designed, built and commissioned in order to overcome this limit. It also brings additional functionality to facilitate parallel commissioning of new detector elements. The new TCDS system and its components will be described and results from the first operational experience with the TCDS in CMS will be shown.
DOI: 10.48550/arxiv.hep-ex/0305096
2003
Cited 12 times
Online Monitoring software framework in the ATLAS experiment
A fast, efficient and comprehensive monitoring system is a vital part of any HEP experiment. This paper describes the software framework that will be used during ATLAS data taking to monitor the state of the data acquisition and the quality of physics data in the experiment. The framework has been implemented by the Online Software group of the ATLAS Trigger&amp;Data Acquisition (TDAQ) project and has already been used for several years in the ATLAS test beams at CERN. The inter-process communication in the framework is implemented via CORBA, which provides portability between different operating systems and programming languages. This paper will describe the design and the most important aspects of the online monitoring framework implementation. It will also show some test results, which indicate the performance and scalability of the current implementation.
DOI: 10.1109/tns.2007.912071
2008
Cited 9 times
Access Control Design and Implementations in the ATLAS Experiment
The ATLAS experiment operates with a significant number of hardware and software resources. Their protection against misuse is an essential task to ensure a safe and optimal operation. To achieve this goal, the Role Based Access Control (RBAC) model has been chosen for its scalability, flexibility, ease of administration and usability from the lowest operating system level to the highest software application level. This paper presents the overall design of RBAC implementation in the ATLAS experiment and the enforcement solutions in different areas such as the system administration, control room desktops and the data acquisition software. The users and the roles are centrally managed using a directory service based on Lightweight Directory Access Protocol which is kept in synchronization with the human resources and IT databases.
DOI: 10.1088/1742-6596/119/2/022001
2008
Cited 8 times
Integration of the trigger and data acquisition systems in ATLAS
During 2006 and the first half of 2007, the installation, integration and commissioning of trigger and data acquisition (TDAQ) equipment in the ATLAS experimental area have progressed. There have been a series of technical runs using the final components of the system already installed in the experimental area. Various tests have been run including ones where level 1 preselected simulated proton-proton events have been processed in a loop mode through the trigger and dataflow chains. The system included the readout buffers containing the events, event building, level 2 and event filter trigger algorithms. The scalability of the system with respect to the number of event building nodes used has been studied and quantities critical for the final system, such as trigger rates and event processing times, have been measured using different trigger algorithms as well as different TDAQ components. This paper presents the TDAQ architecture, the current status of the installation and commissioning and highlights the main test results that validate the system.
DOI: 10.1088/1742-6596/119/2/022004
2008
Cited 8 times
The ATLAS DAQ system online configurations database service challenge
This paper describes challenging requirements on the configuration service for the ATLAS experiment at CERN. It presents the status of the implementation and testing one year before the start of data taking, providing details of:
DOI: 10.1088/1742-6596/396/1/012008
2012
Cited 7 times
The CMS High Level Trigger System: Experience and Future Development
The CMS experiment at the LHC features a two-level trigger system. Events accepted by the first level trigger, at a maximum rate of 100 kHz, are read out by the Data Acquisition system (DAQ), and subsequently assembled in memory in a farm of computers running a software high-level trigger (HLT), which selects interesting events for offline storage and analysis at a rate of order few hundred Hz. The HLT algorithms consist of sequences of offline-style reconstruction and filtering modules, executed on a farm of 0(10000) CPU cores built from commodity hardware. Experience from the operation of the HLT system in the collider run 2010/2011 is reported. The current architecture of the CMS HLT, its integration with the CMS reconstruction framework and the CMS DAQ, are discussed in the light of future development. The possible short- and medium-term evolution of the HLT software infrastructure to support extensions of the HLT computing power, and to address remaining performance and maintenance issues, are discussed.
DOI: 10.1109/tns.2013.2282340
2013
Cited 6 times
A Comprehensive Zero-Copy Architecture for High Performance Distributed Data Acquisition Over Advanced Network Technologies for the CMS Experiment
This paper outlines a software architecture where zero-copy operations are used comprehensively at every processing point from the Application layer to the Physical layer. The proposed architecture is being used during feasibility studies on advanced networking technologies for the CMS experiment at CERN. The design relies on a homogeneous peer-to-peer message passing system, which is built around memory pool caches allowing efficient and deterministic latency handling of messages of any size through the different software layers. In this scheme portable distributed applications can be programmed to process input to output operations by mere pointer arithmetic and DMA operations only. The approach combined with the open fabric protocol stack (OFED) allows one to attain near wire-speed message transfer at application level. The architecture supports full portability of user applications by encapsulating the protocol details and network into modular peer transport services whereas a transparent replacement of the underlying protocol facilitates deployment of several network technologies like Gigabit Ethernet, Myrinet, Infiniband, etc. Therefore, this solution provides a protocol-independent communication framework and prevents having to deal with potentially difficult couplings when the underlying communication infrastructure is changed. We demonstrate the feasibility of this approach by giving efficiency and performance measurements of the software in the context of the CMS distributed event building studies.
DOI: 10.22323/1.213.0190
2015
Cited 6 times
Boosting Event Building Performance using Infiniband FDR for the CMS Upgrade
As part of the CMS upgrade during CERN's shutdown period (LS1), the CMS data acquisition system is incorporating Infiniband FDR technology to boost event-building performance for operation from 2015 onwards.Infiniband promises to provide substantial increase in data transmission speeds compared to the older 1GE network used during the 2009-2013 LHC run.Several options exist to end user developers when choosing a foundation for software upgrades, including the uDAPL (DAT Collaborative) and Infiniband verbs libraries (OFED).Due to advances in technology, the CMS data acquisition system will be able to achieve the required throughput of 100 kHz with increased event sizes while downsizing the number of nodes by using a combination of 10GE, 40GE and 56 Gb Infiniband FDR.This paper presents the analysis and results of a comparison between GE and Infiniband solutions as well as a look at how they integrate into an event building architecture, while preserving the scalability, efficiency and deterministic latency expected in a high end data acquisition network.
DOI: 10.22323/1.370.0111
2020
Cited 6 times
First measurements with the CMS DAQ and Timing Hub prototype-1
The DAQ and Timing Hub is an ATCA hub board designed for the Phase-2 upgrade of the CMS experiment.In addition to providing high-speed Ethernet connectivity to all back-end boards, it forms the bridge between the sub-detector electronics and the central DAQ, timing, and trigger control systems.One important requirement is the distribution of several high-precision, phasestable, and LHC-synchronous clock signals for use by the timing detectors.The current paper presents first measurements performed on the initial prototype, with a focus on clock quality.It is demonstrated that the current design provides adequate clock quality to satisfy the requirements of the Phase-2 CMS timing detectors.
DOI: 10.1051/epjconf/202125104023
2021
Cited 5 times
The Phase-2 Upgrade of the CMS Data Acquisition
The High Luminosity LHC (HL-LHC) will start operating in 2027 after the third Long Shutdown (LS3), and is designed to provide an ultimate instantaneous luminosity of 7:5 × 10 34 cm −2 s −1 , at the price of extreme pileup of up to 200 interactions per crossing. The number of overlapping interactions in HL-LHC collisions, their density, and the resulting intense radiation environment, warrant an almost complete upgrade of the CMS detector. The upgraded CMS detector will be read out by approximately fifty thousand highspeed front-end optical links at an unprecedented data rate of up to 80 Tb/s, for an average expected total event size of approximately 8 − 10 MB. Following the present established design, the CMS trigger and data acquisition system will continue to feature two trigger levels, with only one synchronous hardware-based Level-1 Trigger (L1), consisting of custom electronic boards and operating on dedicated data streams, and a second level, the High Level Trigger (HLT), using software algorithms running asynchronously on standard processors and making use of the full detector data to select events for offline storage and analysis. The upgraded CMS data acquisition system will collect data fragments for Level-1 accepted events from the detector back-end modules at a rate up to 750 kHz, aggregate fragments corresponding to individual Level- 1 accepts into events, and distribute them to the HLT processors where they will be filtered further. Events accepted by the HLT will be stored permanently at a rate of up to 7.5 kHz. This paper describes the baseline design of the DAQ and HLT systems for the Phase-2 of CMS.
DOI: 10.1109/tns.2008.2006050
2008
Cited 7 times
The ATLAS Event Builder
Event data from proton-proton collisions at the LHC will be selected by the ATLAS experiment in a three-level trigger system, which, at its first two trigger levels (LVL1+LVL2), reduces the initial bunch crossing rate of 40 MHz to ~ 3 kHz. At this rate, the Event Builder collects the data from the readout system PCs (ROSs) and provides fully assembled events to the Event Filter (EF). The EF is the third trigger level and its aim is to achieve a further rate reduction to ~ 200 Hz on the permanent storage. The Event Builder is based on a farm of <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O</i> (100) PCs, interconnected via a gigabit Ethernet to <i xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">O</i> (150) ROSs. These PCs run Linux and multi-threaded software applications implemented in C++. All the ROSs, and substantial fractions of the Event Builder and EF PCs have been installed and commissioned. We report on performance tests on this initial system, which is capable of going beyond the required data rates and bandwidths for event building for the ATLAS experiment.
DOI: 10.1109/tns.2008.2002438
2008
Cited 7 times
The Data-Logging System of the Trigger and Data Acquisition for the ATLAS Experiment at CERN
The ATLAS experiment is getting ready to observe collisions between protons at a centre of mass energy of 14 TeV. These will be the highest energy collisions in a controlled environment to-date, to be provided by the Large Hadron Collider at CERN by mid 2008. The ATLAS Trigger and Data Acquisition (TDAQ) system selects events online in a three level trigger system in order to keep those events promising to unveil new physics at a budgeted rate of ~200 Hz for an event size of ~1.5 MB. This paper focuses on the data-logging system on the TDAQ side, the so-called ldquoSub-Farm Outputrdquo (SFO) system. It takes data from the third level trigger, and it streams and indexes the events into different files, according to each event's trigger path. The data files are moved to CASTOR, the central mass storage facility at CERN. The final TDAQ data-logging system has been installed using 6 Linux PCs, holding in total 144 disks of 500 GB each, managed by three RAID controllers on each PC. The data-writing is managed in a controlled round-robin way among three independent filesystems associated to a distinct set of disks, managed by a distinct RAID controller. This novel design allows fast I/O, which together with a high speed network permits to minimize the number of SFO nodes. We report here on the functionality and performance requirements on the system, our experience with commissioning it and on the performance achieved.
DOI: 10.1109/tns.2007.914030
2008
Cited 7 times
Integration of the Trigger and Data Acquisition Systems in ATLAS
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> During 2006 and spring 2007, integration and commissioning of trigger and data acquisition (TDAQ) equipment in the ATLAS experimental area has progressed. Much of the work has focused on a final prototype setup consisting of around eighty computers representing a subset of the full TDAQ system. There have been a series of technical runs using this setup. Various tests have been run including those where around 6 k Level-1 preselected simulated proton–proton events have been processed in a loop mode through the trigger and dataflow chains. The system included the readout buffers containing the events, event building, second level and third level trigger processors. Aspects critical for the final system, such as event processing times, have been studied using different trigger algorithms as well as the different dataflow components. </para>
DOI: 10.1088/1742-6596/331/2/022042
2011
Cited 5 times
Role Based Access Control system in the ATLAS experiment
The complexity of the ATLAS experiment motivated the deployment of an integrated Access Control System in order to guarantee safe and optimal access for a large number of users to the various software and hardware resources. Such an integrated system was foreseen since the design of the infrastructure and is now central to the operations model. In order to cope with the ever growing needs of restricting access to all resources used within the experiment, the Roles Based Access Control (RBAC) previously developed has been extended and improved. The paper starts with a short presentation of the RBAC design, implementation and the changes made to the system to allow the management and usage of roles to control access to the vast and diverse set of resources. The RBAC implementation uses a directory service based on Lightweight Directory Access Protocol to store the users (∼3000), roles (∼320), groups (∼80) and access policies. The information is kept in sync with various other databases and directory services: human resources, central CERN IT, CERN Active Directory and the Access Control Database used by DCS. The paper concludes with a detailed description of the integration across all areas of the system.
DOI: 10.1109/tns.2006.873311
2006
Cited 8 times
ATLAS DataFlow: the read-out subsystem, results from trigger and data-acquisition system testbed studies and from modeling
In the ATLAS experiment at the LHC, the output of read-out hardware specific to each subdetector will be transmitted to buffers, located on custom made PCI cards ("ROBINs"). The data consist of fragments of events accepted by the first-level trigger at a maximum rate of 100 kHz. Groups of four ROBINs will be hosted in about 150 Read-Out Subsystem (ROS) PCs. Event data are forwarded on request via Gigabit Ethernet links and switches to the second-level trigger or to the Event builder. In this paper a discussion of the functionality and real-time properties of the ROS is combined with a presentation of measurement and modelling results for a testbed with a size of about 20% of the final DAQ system. Experimental results on strategies for optimizing the system performance, such as utilization of different network architectures and network transfer protocols, are presented for the testbed, together with extrapolations to the full system.
DOI: 10.1109/tns.2007.910868
2008
Cited 6 times
Performance of the Final Event Builder for the ATLAS Experiment
<para xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> Event data from proton-proton collisions at the LHC will be selected by the ATLAS experiment by a three level trigger system, which reduces the initial bunch crossing rate of 40 MHz at its first two trigger levels (LVL1+LVL2) to <formula formulatype="inline"><tex>$\sim$</tex></formula>3 kHz. At this rate the Event-Builder collects the data from all Read-Out system PCs (ROSs) and provides fully assembled events to the the Event-Filter (EF), which is the third level trigger, to achieve a further rate reduction to <formula formulatype="inline"> <tex>$\sim$</tex></formula>200 Hz for permanent storage. The Event-Builder is based on a farm of <formula formulatype="inline"><tex>$O(100)$</tex></formula> PCs, interconnected via Gigabit Ethernet to <formula formulatype="inline"><tex>$O(150)$</tex> </formula> ROSs. These PCs run Linux and multi-threaded software applications implemented in C++. All the ROSs and one third of the Event-Builder PCs are already installed and commissioned. Performance measurements have been exercised on this initial system, which show promising results that the required final data rates and bandwidth for the ATLAS event builder are in reach. </para>
DOI: 10.1088/1742-6596/219/2/022048
2010
Cited 5 times
System administration of ATLAS TDAQ computing environment
This contribution gives a thorough overview of the ATLAS TDAQ SysAdmin group activities which deals with administration of the TDAQ computing environment supporting High Level Trigger, Event Filter and other subsystems of the ATLAS detector operating on LHC collider at CERN. The current installation consists of approximately 1500 netbooted nodes managed by more than 60 dedicated servers, about 40 multi-screen user interface machines installed in the control rooms and various hardware and service monitoring machines as well. In the final configuration, the online computer farm will be capable of hosting tens of thousands applications running simultaneously. The software distribution requirements are matched by the two level NFS based solution. Hardware and network monitoring systems of ATLAS TDAQ are based on NAGIOS and MySQL cluster behind it for accounting and storing the monitoring data collected, IPMI tools, CERN LANDB and the dedicated tools developed by the group, e.g. ConfdbUI. The user management schema deployed in TDAQ environment is founded on the authentication and role management system based on LDAP. External access to the ATLAS online computing facilities is provided by means of the gateways supplied with an accounting system as well. Current activities of the group include deployment of the centralized storage system, testing and validating hardware solutions for future use within the ATLAS TDAQ environment including new multi-core blade servers, developing GUI tools for user authentication and roles management, testing and validating 64-bit OS, and upgrading the existing TDAQ hardware components, authentication servers and the gateways.
DOI: 10.1088/1742-6596/513/1/012025
2014
Cited 4 times
Prototype of a File-Based High-Level Trigger in CMS
The DAQ system of the CMS experiment at the LHC is upgraded during the accelerator shutdown in 2013/14. To reduce the interdependency of the DAQ system and the high-level trigger (HLT), we investigate the feasibility of using a file-system-based HLT. Events of ~1 MB size are built at the level-1 trigger rate of 100 kHz. The events are assembled by ~50 builder units (BUs). Each BU writes the raw events at ~2GB/s to a local file system shared with Q(10) filter-unit machines (FUs) running the HLT code. The FUs read the raw data from the file system, select Q(1%) of the events, and write the selected events together with monitoring meta-data back to a disk. This data is then aggregated over several steps and made available for offline reconstruction and online monitoring. We present the challenges, technical choices, and performance figures from the prototyping phase. In addition, the steps to the final system implementation will be discussed.
DOI: 10.1088/1742-6596/664/8/082036
2015
Cited 4 times
A scalable monitoring for the CMS Filter Farm based on elasticsearch
A flexible monitoring system has been designed for the CMS File-based Filter Farm making use of modern data mining and analytics components. All the metadata and monitoring information concerning data flow and execution of the HLT are generated locally in the form of small documents using the JSON encoding. These documents are indexed into a hierarchy of elasticsearch (es) clusters along with process and system log information. Elasticsearch is a search server based on Apache Lucene. It provides a distributed, multitenant-capable search and aggregation engine. Since es is schema-free, any new information can be added seamlessly and the unstructured information can be queried in non-predetermined ways. The leaf es clusters consist of the very same nodes that form the Filter Farm thus providing natural horizontal scaling. A separate central" es cluster is used to collect and index aggregated information. The fine-grained information, all the way to individual processes, remains available in the leaf clusters. The central es cluster provides quasi-real-time high-level monitoring information to any kind of client. Historical data can be retrieved to analyse past problems or correlate them with external information. We discuss the design and performance of this system in the context of the CMS DAQ commissioning for LHC Run 2.
2006
Cited 7 times
Studies with the ATLAS Trigger and Data Acquisition "Pre-Series'' Setup
The pre-series test bed is used to validate the technology and implementation choices by comparing the final ATLAS readout requirements, to the results of performance, functionality and stability studies. We show that all the components which are not running reconstruction algorithms match the final ATLAS requirements. For the others, we calculate the amount of time per event that could be allocated to run these not-yet-finalized algorithms. We also report on the experience gained during these studies while interfacing with a sub-detector for the first time at the experimental area.
DOI: 10.1109/tns.2006.878290
2006
Cited 7 times
Deployment and Use of the ATLAS DAQ in the Combined Test Beam
The ATLAS collaboration at CERN operated a combined test beam (CTB) from May until November 2004. The prototype of ATLAS data acquisition system (DAQ) was used to integrate other subsystems into a common CTB setup. Data were collected synchronously from all the ATLAS detectors, which represented nine different detector technologies. Electronics and software of the first level trigger were used to trigger the setup. Event selection algorithms of the high level trigger were integrated with the system and were tested with real detector data. The possibility to operate a remote Event Filter farm synchronized with the ATLAS Trigger and Data Acquisition System (TDAQ) was also tested. Event data, as well as detector conditions data, were made available for offline analysis
DOI: 10.1109/rtc.2007.4382747
2007
Cited 6 times
Performance of the final Event Builder for the ATLAS Experiment
Event data from proton-proton collisions at the LHC will be selected by the ATLAS experiment in a three level trigger system, which reduces the initial bunch crossing rate of 40 MHz at its first two trigger levels (LVL1+LVL2) to ~3 kHz. At this rate the Event-Builder collects the data from all read-out system PCs (ROSs) and provides fully assembled events to the the event-filter (EF), which is the third level trigger, to achieve a further rate reduction to ~ 200 Hz for permanent storage. The event-builder is based on a farm of O(100) PCs, interconnected via gigabit Ethernet to O(150) ROSs. These PCs run Linux and multi-threaded software applications implemented in C++. All the ROSs and one third of the event-builder PCs are already installed and commissioned. We report on performance tests on this initial system, which show promising results to reach the final data throughput required for the ATLAS experiment.
DOI: 10.1088/1742-6596/396/1/012023
2012
Cited 4 times
Status of the CMS Detector Control System
The Compact Muon Solenoid (CMS) is a CERN multi-purpose experiment that exploits the physics of the Large Hadron Collider (LHC). The Detector Control System (DCS) is responsible for ensuring the safe, correct and efficient operation of the experiment, and has contributed to the recording of high quality physics data. The DCS is programmed to automatically react to the LHC operational mode. CMS sub-detectors' bias voltages are set depending on the machine mode and particle beam conditions. An operator provided with a small set of screens supervises the system status summarized from the approximately 6M monitored parameters. Using the experience of nearly two years of operation with beam the DCS automation software has been enhanced to increase the system efficiency by minimizing the time required by sub-detectors to prepare for physics data taking. From the infrastructure point of view the DCS will be subject to extensive modifications in 2012. The current rack mounted control PCs will be replaced by a redundant pair of DELL Blade systems. These blade servers are a high-density modular solution that incorporates servers and networking into a single chassis that provides shared power, cooling and management. This infrastructure modification associated with the migration to blade servers will challenge the DCS software and hardware factorization capabilities. The on-going studies for this migration together with the latest modifications are discussed in the paper.
DOI: 10.1051/epjconf/202024501032
2020
Cited 4 times
40 MHz Level-1 Trigger Scouting for CMS
The CMS experiment will be upgraded for operation at the HighLuminosity LHC to maintain and extend its physics performance under extreme pileup conditions. Upgrades will include an entirely new tracking system, supplemented by a track finder processor providing tracks at Level-1, as well as a high-granularity calorimeter in the endcap region. New front-end and back-end electronics will also provide the Level-1 trigger with high-resolution information from the barrel calorimeter and the muon systems. The upgraded Level-1 processors, based on powerful FPGAs, will be able to carry out sophisticated feature searches with resolutions often similar to the offline ones, while keeping pileup effects under control. In this paper, we discuss the feasibility of a system capturing Level-1 intermediate data at the beam-crossing rate of 40 MHz and carrying out online analyzes based on these limited-resolution data. This 40 MHz scouting system would provide fast and virtually unlimited statistics for detector diagnostics, alternative luminosity measurements and, in some cases, calibrations. It has the potential to enable the study of otherwise inaccessible signatures, either too common to fit in the Level-1 accept budget, or with requirements which are orthogonal to “mainstream” physics, such as long-lived particles. We discuss the requirements and possible architecture of a 40 MHz scouting system, as well as some of the physics potential, and results from a demonstrator operated at the end of Run-2 using the Global Muon Trigger data from CMS. Plans for further demonstrators envisaged for Run-3 are also discussed.
DOI: 10.1109/tns.2015.2409898
2015
Cited 3 times
Achieving High Performance With TCP Over 40 GbE on NUMA Architectures for CMS Data Acquisition
TCP and the socket abstraction have barely changed over the last two decades, but at the network layer there has been a giant leap from a few megabits to 100 gigabits in bandwidth. At the same time, CPU architectures have evolved into the multi-core era and applications are expected to make full use of all available resources. Applications in the data acquisition domain based on the standard socket library running in a Non-Uniform Memory Access (NUMA) architecture are unable to reach full efficiency and scalability without the software being adequately aware about the IRQ (Interrupt Request), CPU and memory affinities. During the first long shutdown of LHC, the CMS DAQ system is going to be upgraded for operation from 2015 onwards and a new software component has been designed and developed in the CMS online framework for transferring data with sockets. This software attempts to wrap the low-level socket library to ease higher-level programming with an API based on an asynchronous event driven model similar to the DAT uDAPL API. It is an event-based application with NUMA optimizations, that allows for a high throughput of data across a large distributed system. This paper describes the architecture, the technologies involved and the performance measurements of the software in the context of the CMS distributed event building.
DOI: 10.1088/1742-6596/664/8/082009
2015
Cited 3 times
Online data handling and storage at the CMS experiment
During the LHC Long Shutdown 1, the CMS Data Acquisition (DAQ) system underwent a partial redesign to replace obsolete network equipment, use more homogeneous switching technologies, and support new detector back-end electronics. The software and hardware infrastructure to provide input, execute the High Level Trigger (HLT) algorithms and deal with output data transport and storage has also been redesigned to be completely file- based. All the metadata needed for bookkeeping are stored in files as well, in the form of small documents using the JSON encoding. The Storage and Transfer System (STS) is responsible for aggregating these files produced by the HLT, storing them temporarily and transferring them to the T0 facility at CERN for subsequent offline processing. The STS merger service aggregates the output files from the HLT from ∼62 sources produced with an aggregate rate of ∼2GB/s. An estimated bandwidth of 7GB/s in concurrent read/write mode is needed. Furthermore, the STS has to be able to store several days of continuous running, so an estimated of 250TB of total usable disk space is required. In this article we present the various technological and implementation choices of the three components of the STS: the distributed file system, the merger service and the transfer system.
DOI: 10.1088/1742-6596/664/2/022012
2015
Cited 3 times
The Diverse use of Clouds by CMS
The resources CMS is using are increasingly being offered as clouds. In Run 2 of the LHC the majority of CMS CERN resources, both in Meyrin and at the Wigner Computing Centre, will be presented as cloud resources on which CMS will have to build its own infrastructure. This infrastructure will need to run all of the CMS workflows including: Tier 0, production and user analysis. In addition, the CMS High Level Trigger will provide a compute resource comparable in scale to the total offered by the CMS Tier 1 sites, when it is not running as part of the trigger system. During these periods a cloud infrastructure will be overlaid on this resource, making it accessible for general CMS use. Finally, CMS is starting to utilise cloud resources being offered by individual institutes and is gaining experience to facilitate the use of opportunistically available cloud resources.
DOI: 10.1109/rtc.2016.7543164
2016
Cited 3 times
Performance of the new DAQ system of the CMS experiment for run-2
The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of more than 100GB/s to the Highlevel Trigger (HLT) farm. The HLT farm selects and classifies interesting events for storage and offline analysis at an output rate of around 1 kHz. The DAQ system has been redesigned during the accelerator shutdown in 2013-2014. The motivation for this upgrade was twofold. Firstly, the compute nodes, networking and storage infrastructure were reaching the end of their lifetimes. Secondly, in order to maintain physics performance with higher LHC luminosities and increasing event pileup, a number of sub-detectors are being upgraded, increasing the number of readout channels as well as the required throughput, and replacing the off-detector readout electronics with a MicroTCA-based DAQ interface. The new DAQ architecture takes advantage of the latest developments in the computing industry. For data concentration 10/40 Gbit/s Ethernet technologies are used, and a 56Gbit/s Infiniband FDR CLOS network (total throughput ≈ 4Tbit/s) has been chosen for the event builder. The upgraded DAQ - HLT interface is entirely file-based, essentially decoupling the DAQ and HLT systems. The fully-built events are transported to the HLT over 10/40 Gbit/s Ethernet via a network file system. The collection of events accepted by the HLT and the corresponding metadata are buffered on a global file system before being transferred off-site. The monitoring of the HLT farm and the data-taking performance is based on the Elasticsearch analytics tool. This paper presents the requirements, implementation, and performance of the system. Experience is reported on the first year of operation with LHC proton-proton runs as well as with the heavy ion lead-lead runs in 2015.
DOI: 10.5170/cern-2005-002.159
2004
Cited 6 times
CONTROL IN THE ATLAS TDAQ SYSTEM
The unprecedented size and complexity of the ATLAS TDAQ system requires a comprehensive and flexible control system. Its role ranges from the so-called run- control, e.g. starting and stopping the data taking, to error handling and fault tolerance. It also includes initialization and verification of the overall system. Following the traditional approach a hierarchical system of customizable controllers has been proposed. For the final system all functionality will be therefore available in a distributed manner, with the possibility of local customization. After a technology survey the open source expert system CLIPS has been chosen as a basis for the implementation of the supervision and the verification system. The CLIPS interpreter has been extended to provide a general control framework. Other ATLAS Online software components have been integrated as plug-ins and provide the mechanism for configuration and communication. Several components have been implemented sharing this technology. The dynamic behavior of the individual component is fully described by the rules, while the framework is based on a common implementation. During this year these components have been the subject of scalability tests up to the full system size. Encouraging results are presented and validate the technology choice.
DOI: 10.1109/tns.2006.878449
2006
Cited 6 times
Deployment of the ATLAS High-Level Trigger
The ATLAS combined test beam in the second half of 2004 saw the first deployment of the ATLAS High-Level Trigger (HLT). The next steps are deployment on the pre-series farms in the experimental area during 2005, commissioning and cosmics tests with the full detector in 2006 and collisions in 2007. This paper reviews the experience gained in the test beam, describes the current status and discusses the further enhancements to be made. We address issues related to the dataflow, integration of selection algorithms, testing, software distribution, installation and improvements.
DOI: 10.1109/tns.2007.910507
2008
Cited 4 times
The Process Manager in the ATLAS DAQ System
This paper describes the process manager in the ATLAS DAQ system. The purpose of the process manager is to perform basic process control on behalf of the software components of the DAQ system. It is able to create, destroy and monitor the basic status (e.g., running, exited, killed) of software components on the DAQ workstations and front-end processors. Section I gives a brief overview of the process manager functionalities. Section II focuses on the requirements the process manager system has to fulfil to be fully integrated in the DAQ system. Section III shows how the requirements are met by the current implementation. The communication schema between the different parts of the process manager system, the procedure to launch a process and the possible states in which a process can be are described in Sections IV ,V and VI. Section VII deals with some consideration of the process manager performance while some conclusions are given in Section VIII.
DOI: 10.1088/1742-6596/513/1/012014
2014
Cited 3 times
The new CMS DAQ system for LHC operation after 2014 (DAQ2)
The Data Acquisition system of the Compact Muon Solenoid experiment at CERN assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GByte/s. We are presenting the design of the 2nd generation DAQ system, including studies of the event builder based on advanced networking technologies such as 10 and 40 Gbit/s Ethernet and 56 Gbit/s FDR Infiniband and exploitation of multicore CPU architectures. By the time the LHC restarts after the 2013/14 shutdown, the current compute nodes, networking, and storage infrastructure will have reached the end of their lifetime. In order to handle higher LHC luminosities and event pileup, a number of sub-detectors will be upgraded, increase the number of readout channels and replace the off-detector readout electronics with a μTCA implementation. The second generation DAQ system, foreseen for 2014, will need to accommodate the readout of both existing and new off-detector electronics and provide an increased throughput capacity. Advances in storage technology could make it feasible to write the output of the event builder to (RAM or SSD) disks and implement the HLT processing entirely file based.
DOI: 10.1088/1742-6596/396/1/012007
2012
Cited 3 times
Operational experience with the CMS Data Acquisition System
The data-acquisition (DAQ) system of the CMS experiment at the LHC performs the read-out and assembly of events accepted by the first level hardware trigger. Assembled events are made available to the high-level trigger (HLT), which selects interesting events for offline storage and analysis. The system is designed to handle a maximum input rate of 100 kHz and an aggregated throughput of 100 GB/s originating from approximately 500 sources and 10^8 electronic channels. An overview of the architecture and design of the hardware and software of the DAQ system is given. We report on the performance and operational experience of the DAQ and its Run Control System in the first two years of collider runs of the LHC, both in proton-proton and Pb-Pb collisions. We present an analysis of the current performance, its limitations, and the most common failure modes and discuss the ongoing evolution of the HLT capability needed to match the luminosity ramp-up of the LHC.
DOI: 10.1088/1742-6596/513/1/012031
2014
Cited 3 times
Automating the CMS DAQ
We present the automation mechanisms that have been added to the Data Acquisition and Run Control systems of the Compact Muon Solenoid (CMS) experiment during Run 1 of the LHC, ranging from the automation of routine tasks to automatic error recovery and context-sensitive guidance to the operator. These mechanisms helped CMS to maintain a data taking efficiency above 90% and to even improve it to 95% towards the end of Run 1, despite an increase in the occurrence of single-event upsets in sub-detector electronics at high LHC luminosity.
2006
Cited 5 times
The Architecture and Administration of the ATLAS Online Computing System
The needs of ATLAS experiment at the upcoming LHC accelerator, CERN, in terms of data transmission rates and processing power require a large cluster of computers (of the order of thousands) administrated and exploited in a coherent and optimal manner. Requirements like stability, robustness and fast recovery in case of failure impose a server- client system architecture with servers distributed in a tree like structure and clients booted from the network. For security reasons, the system should be accessible only through an application gateway and, also to ensure the autonomy of the system, the network services should be provided internally by dedicated machines in synchronization with CERN IT depart me nt ' s central services. The paper describes a small scale implementation of the system architecture that fits the given requirements and constraints. Emphasis is put on the mechanisms and tools used to net boot the clients via the “Boot With Me” project and to synchronize information within the cluster via the “Nile” tool.
DOI: 10.1109/tns.2006.873308
2006
Cited 5 times
Access management in the ATLAS TDAQ
In the Trigger and Data AcQuisition (TDAQ) system for the ATLAS project authorization of users will be an important task. The main goal of the authorization will be to reduce the chance of potentially dangerous actions being made by mistake. An Access Management (AM) component is being developed within the TDAQ to handle these issues. This paper presents the design and implementation of the component. It also describes the authorization model used and how authorization data is stored and administrated for the system.
DOI: 10.5170/cern-2005-002.105
2005
Cited 5 times
Experience with CORBA communication middleware in the ATLAS DAQ.
DOI: 10.1088/1742-6596/898/3/032019
2017
Cited 3 times
The CMS Data Acquisition - Architectures for the Phase-2 Upgrade
The upgraded High Luminosity LHC, after the third Long Shutdown (LS3), will provide an instantaneous luminosity of 7.5 × 1034 cm−2s−1 (levelled), at the price of extreme pileup of up to 200 interactions per crossing. In LS3, the CMS Detector will also undergo a major upgrade to prepare for the phase-2 of the LHC physics program, starting around 2025. The upgraded detector will be read out at an unprecedented data rate of up to 50 Tb/s and an event rate of 750 kHz. Complete events will be analysed by software algorithms running on standard processing nodes, and selected events will be stored permanently at a rate of up to 10 kHz for offline processing and analysis.
DOI: 10.1051/epjconf/201921407017
2019
Cited 3 times
Experience with dynamic resource provisioning of the CMS online cluster using a cloud overlay
The primary goal of the online cluster of the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is to build event data from the detector and to select interesting collisions in the High Level Trigger (HLT) farm for offline storage. With more than 1500 nodes and a capacity of about 850 kHEPSpecInt06, the HLT machines represent similar computing capacity of all the CMS Tier1 Grid sites together. Moreover, it is currently connected to the CERN IT datacenter via a dedicated 160 Gbps network connection and hence can access the remote EOS based storage with a high bandwidth. In the last few years, a cloud overlay based on OpenStack has been commissioned to use these resources for the WLCG when they are not needed for data taking. This online cloud facility was designed for parasitic use of the HLT, which must never interfere with its primary function as part of the DAQ system. It also allows to abstract from the different types of machines and their underlying segmented networks. During the LHC technical stop periods, the HLT cloud is set to its static mode of operation where it acts like other grid facilities. The online cloud was also extended to make dynamic use of resources during periods between LHC fills. These periods are a-priori unscheduled and of undetermined length, typically of several hours, once or more a day. For that, it dynamically follows LHC beam states and hibernates Virtual Machines (VM) accordingly. Finally, this work presents the design and implementation of a mechanism to dynamically ramp up VMs when the DAQ load on the HLT reduces towards the end of the fill.
DOI: 10.1088/1742-6596/119/2/022022
2008
Cited 3 times
Event reconstruction algorithms for the ATLAS trigger
The ATLAS experiment under construction at CERN is due to begin operation at the end of 2007. The detector will record the results of proton-proton collisions at a center-of-mass energy of 14 TeV. The trigger is a three-tier system designed to identify in real-time potentially interesting events that are then saved for detailed offline analysis. The trigger system will select approximately 200 Hz of potentially interesting events out of the 40 MHz bunch-crossing rate (with 109 interactions per second at the nominal luminosity).
2014
The new CMS DAQ system for LHC operation after 2014 (DAQ2)
The Data Acquisition system of the Compact Muon Solenoid experiment at CERN assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GByte/s. We are presenting the design of the 2nd generation DAQ system, including studies of the event builder based on advanced networking technologies such as 10 and 40 Gbit/s Ethernet and 56 Gbit/s FDR Infiniband and exploitation of multicore CPU architectures. By the time the LHC restarts after the 2013/14 shutdown, the current compute nodes, networking, and storage infrastructure will have reached the end of their lifetime. In order to handle higher LHC luminosities and event pileup, a number of sub-detectors will be upgraded, increase the number of readout channels and replace the off-detector readout electronics with a μTCA implementation. The second generation DAQ system, foreseen for 2014, will need to accommodate the readout of both existing and new off-detector electronics and provide an increased throughput capacity. Advances in storage technology could make it feasible to write the output of the event builder to (RAM or SSD) disks and implement the HLT processing entirely file based.
DOI: 10.1109/rtc.2014.7097437
2014
The new CMS DAQ system for run-2 of the LHC
Summary form only given. The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GB/s to the high level trigger (HLT) farm. The HLT farm selects interesting events for storage and offline analysis at a rate of around 1 kHz. The DAQ system has been redesigned during the accelerator shutdown in 2013/14. The motivation is twofold: Firstly, the current compute nodes, networking, and storage infrastructure will have reached the end of their lifetime by the time the LHC restarts. Secondly, in order to handle higher LHC luminosities and event pileup, a number of sub-detectors will be upgraded, increasing the number of readout channels and replacing the off-detector readout electronics with a μTCA implementation. The new DAQ architecture will take advantage of the latest developments in the computing industry. For data concentration, 10/40 Gb/s Ethernet technologies will be used, as well as an implementation of a reduced TCP/IP in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gb/s Infiniband FDR Clos network has been chosen for the event builder with a throughput of ~4 Tb/s. The HLT processing is entirely file based. This allows the DAQ and HLT systems to be independent, and to use the HLT software in the same way as for the offline processing. The fully built events are sent to the HLT with 1/10/40 Gb/s Ethernet via network file systems. Hierarchical collection of HLT accepted events and monitoring meta-data are stored into a global file system. This paper presents the requirements, technical choices, and performance of the new system.
2013
10 Gbps TCP/IP streams from the FPGA for the CMS DAQ eventbuilder network
DOI: 10.1109/tns.2023.3244696
2023
Progress in Design and Testing of the DAQ and Data-Flow Control for the Phase-2 Upgrade of the CMS Experiment
The CMS detector will undergo a major upgrade for the Phase-2 of theLHC program the High-Luminosity LHC.The upgraded CMS detector willbe read out at an unprecedented data rate exceed-ing50 Tb/s, with a Level-1 trigger selecting eventsat a rate of 750 kHz, and an average event size reaching8.5MB.The Phase-2 CMS back-end electronics will bebased on the ATCA standard, with node boards receiving the detectordata from the front-ends via custom, radiation-tolerant, opticallinks.The CMS Phase-2 data acquisition (DAQ) design tightens the integrationbetween trigger control and data flow, extending the synchronousregime of the DAQ system.At the core of the design is the DAQ andTiming Hub, a custom ATCA hub card forming the bridge between thedifferent, detectorspecific, control and readout electronics and thecommon timing, trigger, and control systems.The overall synchronisation and data flow of the experiment is handledby the Trigger and Timing Control and Distribution System (TCDS).Forincreased flexibility during commissioning and calibration runs, thePhase-2 architecture breaks with the traditional distribution tree, infavour of a configurable network connecting multiple independentcontrol units to all off-detector endpoints.This paper describes the overall Phase-2 TCDS architecture, andbriefly compares it to previous CMS implementations.It then discussesthe design and prototyping experience of the DTH, and concludes withthe convergence of this prototyping process into the (pre)productionphase, starting in early 2023.
DOI: 10.1088/1742-6596/396/1/012041
2012
High availability through full redundancy of the CMS detector controls system
The CMS detector control system (DCS) is responsible for controlling and monitoring the detector status and for the operation of all CMS sub detectors and infrastructure. This is required to ensure safe and efficient data taking so that high quality physics data can be recorded. The current system architecture is composed of more than 100 servers in order to provide the required processing resources. An optimization of the system software and hardware architecture is under development to ensure redundancy of all the controlled subsystems and to reduce any downtime due to hardware or software failures. The new optimized structure is based mainly on powerful and highly reliable blade servers and makes use of a fully redundant approach, guaranteeing high availability and reliability. The analysis of the requirements, the challenges, the improvements and the optimized system architecture as well as its specific hardware and software solutions are presented.
DOI: 10.1088/1742-6596/898/3/032020
2017
Performance of the CMS Event Builder
DOI: 10.1051/epjconf/201921401015
2019
Operational experience with the new CMS DAQ-Expert
The data acquisition (DAQ) system of the Compact Muon Solenoid (CMS) at CERN reads out the detector at the level-1 trigger accept rate of 100 kHz, assembles events with a bandwidth of 200 GB/s, provides these events to the high level-trigger running on a farm of about 30k cores and records the accepted events. Comprising custom-built and cutting edge commercial hardware and several 1000 instances of software applications, the DAQ system is complex in itself and failures cannot be completely excluded. Moreover, problems in the readout of the detectors,in the first level trigger system or in the high level trigger may provoke anomalous behaviour of the DAQ systemwhich sometimes cannot easily be differentiated from a problem in the DAQ system itself. In order to achieve high data taking efficiency with operators from the entire collaboration and without relying too heavily on the on-call experts, an expert system, the DAQ-Expert, has been developed that can pinpoint the source of most failures and give advice to the shift crew on how to recover in the quickest way. The DAQ-Expert constantly analyzes monitoring data from the DAQ system and the high level trigger by making use of logic modules written in Java that encapsulate the expert knowledge about potential operational problems. The results of the reasoning are presented to the operator in a web-based dashboard, may trigger sound alerts in the control room and are archived for post-mortem analysis - presented in a web-based timeline browser. We present the design of the DAQ-Expert and report on the operational experience since 2017, when it was first put into production.
2016
Opportunistic usage of the CMS online cluster using a cloud overlay
DOI: 10.1109/nssmic.2000.949420
2002
Cited 4 times
IEEE 802.3 Ethernet, current status and future prospects at the LHC
The status of the IEEE 802.3 standard is reviewed and prospects for the future, including the new 10 Gigabit version of Ethernet, are discussed. The relevance of Ethernet for experiments at the CERN Large Hadron Collider is considered with emphasis on on-line applications and areas which are technically challenging.
DOI: 10.1109/rtc.2007.4382770
2007
The ATLAS DAQ System Online Configurations Database Service Challenge
This paper describes challenging requirements on the configuration service for the ATLAS experiment at CERN. It presents the status of the implementation and testing one year before the start of data taking, providing details of: 1. the capabilities of the underlying OKS object manager to store and to archive configuration descriptions, its user and programming interfaces; 2. the organization of configuration descriptions for different types of data taking runs and combinations of participating sub-detectors; 3. the scalable architecture to support simultaneous access to the service by thousands of processes during the online configuration stage of ATLAS; 4. the experience with the usage of the configuration service during large scale tests, test beam, commissioning and technical runs. The paper also presents pro and contra of the chosen object-oriented implementation compared with solutions based on pure relational database technologies, and explains why after several years of usage we continue with our approach.
DOI: 10.1109/rtc.2007.4382744
2007
The Process Manager in the ATLAS DAQ System
This paper describes the process manager in the ATLAS DAQ system. The purpose of the process manager is to perform basic process control on behalf of the software components of the DAQ system. It is able to create, destroy and monitor the basic status (e.g., running, exited, killed) of software components on the DAQ workstations and front-end processors. Section I gives a brief overview of the process manager functionalities. Section II focuses on the requirements the process manager system has to fulfil to be fully integrated in the DAQ system. Section III shows how the requirements are met by the current implementation. The communication schema between the different parts of the process manager system, the procedure to launch a process and the possible states in which a process can be are described in Sections IV, V and VI. Section VII deals with some consideration of the process manager performance while some conclusions are given in Section VIII.
DOI: 10.1109/nssmic.2007.4436300
2007
The ATLAS event builder
Event data from proton-proton collisions at the LHC will be selected by the ATLAS experiment in a three-level trigger system, which, at its first two trigger levels (LVL1+LVL2), reduces the initial bunch crossing rate of 40 MHz to ~3 kHz. At this rate, the Event Builder collects the data from the readout system PCs (ROSs) and provides fully assembled events to the Event Filter (EF). The EF is the third trigger level and its aim is to achieve a further rate reduction to ~200 Hz on the permanent storage. The Event Builder is based on a farm of 0(100) PCs, interconnected via a Gigabit Ethernet to 0(150) ROSs. These PCs run Linux and multi-threaded software applications implemented in C++. All the ROSs, and substantial fractions of the Event Builder and Event Filter PCs have been installed and commissioned. We report on performance tests on this initial system, which is capable of going beyond the required data rates and bandwidths for Event Building for the ATLAS experiment.
DOI: 10.1109/rtc.2007.4382844
2007
Integration of the Trigger and Data Acquisition systems in ATLAS
During 2006 and spring 2007, integration and commissioning of trigger and data acquisition (TDAQ) equipment in the ATLAS experimental area has progressed. Much of the work has focused on a final prototype setup consisting of around eighty computers representing a subset of the full TDAQ system. There have been a series of technical runs using this setup. Various tests have been run including ones where around 6k Level-1 pre-selected simulated proton-proton events have been processed in a loop mode through the trigger and dataflow chains. The system included the readout buffers containing the events, event building, second level and third level trigger algorithms. Quantities critical for the final system, such as event processing times, have been studied using different trigger algorithms as well as different dataflow components.
DOI: 10.1088/1742-6596/664/8/082035
2015
A New Event Builder for CMS Run II
The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100GB/s to the high-level trigger (HLT) farm. The DAQ system has been redesigned during the LHC shutdown in 2013/14. The new DAQ architecture is based on state-of-the-art network technologies for the event building. For the data concentration, 10/40 Gbps Ethernet technologies are used together with a reduced TCP/IP protocol implemented in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gbps Infiniband FDR CLOS network has been chosen for the event builder. This paper discusses the software design, protocols, and optimizations for exploiting the hardware capabilities. We present performance measurements from small-scale prototypes and from the full-scale production system.
DOI: 10.1088/1742-6596/664/8/082033
2015
File-based data flow in the CMS Filter Farm
During the LHC Long Shutdown 1, the CMS Data Acquisition system underwent a partial redesign to replace obsolete network equipment, use more homogeneous switching technologies, and prepare the ground for future upgrades of the detector front-ends. The software and hardware infrastructure to provide input, execute the High Level Trigger (HLT) algorithms and deal with output data transport and storage has also been redesigned to be completely file- based. This approach provides additional decoupling between the HLT algorithms and the input and output data flow. All the metadata needed for bookkeeping of the data flow and the HLT process lifetimes are also generated in the form of small "documents" using the JSON encoding, by either services in the flow of the HLT execution (for rates etc.) or watchdog processes. These "files" can remain memory-resident or be written to disk if they are to be used in another part of the system (e.g. for aggregation of output data). We discuss how this redesign improves the robustness and flexibility of the CMS DAQ and the performance of the system currently being commissioned for the LHC Run 2.
DOI: 10.18429/jacow-icalepcs2015-wepgf013
2015
Increasing Availability by Implementing Software Redundancy in the CMS Detector Control System
DOI: 10.1109/rtc.2005.1547459
2005
Deployment of the ATLAS high level trigger
The ATLAS combined test beam in the second half of 2004 saw the first deployment of the ATLAS high-level triggers (HLT). The next steps are deployment on the pre-series farms in the experimental area during 2005, commissioning and cosmics tests in 2006 and collisions in 2007. This paper reviews the experience gained in the test beam, describes the current status and discusses the further enhancements to be made. We address issues related to the dataflow, selection algorithms, testing, software distribution, installation and improvements
DOI: 10.1109/rtc.2005.1547446
2005
ATLAS DataFlow: the read-out subsystem, results from trigger and data-acquisition system testbed studies and from modeling
In the ATLAS experiment at the LHC, the output of readout hardware specific to each subdetector will be transmitted to buffers, located on custom made PCI cards ("ROBINs"). The data consist of fragments of events accepted by the first-level trigger at a maximum rate of 100 kHz. Groups of four ROBINs will be hosted in about 150 read-out subsystem (ROS) PCs. Event data are forwarded on request via Gigabit Ethernet links and switches to the second-level trigger or to the event builder. In this paper a discussion of the functionality and real-time properties of the ROS is combined with a presentation of measurement and modeling results for a testbed with a size of about 20% of the final DAQ system. Experimental results on strategies for optimizing the system performance, such as utilization of different network architectures and network transfer protocols, are presented for the testbed, together with extrapolations to the full system
DOI: 10.1088/1742-6596/396/1/012038
2012
Distributed error and alarm processing in the CMS data acquisition system
The error and alarm system for the data acquisition of the Compact Muon Solenoid (CMS) at CERN was successfully used for the physics runs at Large Hadron Collider (LHC) during first three years of activities. Error and alarm processing entails the notification, collection, storing and visualization of all exceptional conditions occurring in the highly distributed CMS online system using a uniform scheme. Alerts and reports are shown on-line by web application facilities that map them to graphical models of the system as defined by the user. A persistency service keeps a history of all exceptions occurred, allowing subsequent retrieval of user defined time windows of events for later playback or analysis. This paper describes the architecture and the technologies used and deals with operational aspects during the first years of LHC operation. In particular we focus on performance, stability, and integration with the CMS sub-detectors.
DOI: 10.1109/rtc.2012.6418362
2012
Recent experience and future evolution of the CMS High Level Trigger System
The CMS experiment at the LHC uses a two-stage trigger system, with events flowing from the first level trigger at a rate of 100 kHz. These events are read out by the Data Acquisition system (DAQ), assembled in memory in a farm of computers, and finally fed into the high-level trigger (HLT) software running on the farm. The HLT software selects interesting events for offline storage and analysis at a rate of a few hundred Hz. The HLT algorithms consist of sequences of offline-style reconstruction and filtering modules, executed on a farm of 0(10000) CPU cores built from commodity hardware. Experience from the 2010–2011 collider run is detailed, as well as the current architecture of the CMS HLT, and its integration with the CMS reconstruction framework and CMS DAQ. The short- and medium-term evolution of the HLT software infrastructure is discussed, with future improvements aimed at supporting extensions of the HLT computing power, and addressing remaining performance and maintenance issues.
DOI: 10.1088/1742-6596/396/1/012039
2012
Upgrade of the CMS Event Builder
The Data Acquisition system of the Compact Muon Solenoid experiment at CERN assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GB/s. By the time the LHC restarts after the 2013/14 shut-down, the current computing and networking infrastructure will have reached the end of their lifetime. This paper presents design studies for an upgrade of the CMS event builder based on advanced networking technologies such as 10/40 Gb/s Ethernet and Infiniband. The results of performance measurements with small-scale test setups are shown.
DOI: 10.22323/1.270.0022
2017
Opportunistic usage of the CMS online cluster using a cloud overlay
After two years of maintenance and upgrade, the Large Hadron Collider (LHC), the largest and most powerful particle accelerator in the world, has started its second three year run. Around 1500 computers make up the CMS (Compact Muon Solenoid) Online cluster. This cluster is used for Data Acquisition of the CMS experiment at CERN, selecting and sending to storage around 20 TBytes of data per day that are then analysed by the Worldwide LHC Computing Grid (WLCG) infrastructure that links hundreds of data centres worldwide. 3000 CMS physicists can access and process data, and are always seeking more computing power and data. The backbone of the CMS Online cluster is composed of 16000 cores which provide as much computing power as all CMS WLCG Tier1 sites (352K HEP-SPEC-06 score in the CMS cluster versus 300K across CMS Tier1 sites). The computing power available in the CMS cluster can significantly speed up the processing of data, so an effort has been made to allocate the resources of the CMS Online cluster to the grid when it isn’t used to its full capacity for data acquisition. This occurs during the maintenance periods when the LHC is non-operational, which corresponded to 117 days in 2015. During 2016, the aim is to increase the availability of the CMS Online cluster for data processing by making the cluster accessible during the time between two physics collisions while the LHC and beams are being prepared. This is usually the case for a few hours every day, which would vastly increase the computing power available for data processing. Work has already been undertaken to provide this functionality, as an OpenStack cloud layer has been deployed as a minimal overlay that leaves the primary role of the cluster untouched. This overlay also abstracts the different hardware and networks that the cluster is composed of. The operation of the cloud (starting and stopping the virtual machines) is another challenge that has been overcome as the cluster has only a few hours spare during the aforementioned beam preparation. By improving the virtual image deployment and integrating the OpenStack services with the core services of the Data Acquisition on the CMS Online cluster it is now possible to start a thousand virtual machines within 10 minutes and to turn them off within seconds. This document will explain the architectural choices that were made to reach a fully redundant and scalable cloud, with a minimal impact on the running cluster configuration while giving a maximal segregation between the services. It will also present how to cold start 1000 virtual machines 25 times faster, using tools commonly utilised in all data centres.
DOI: 10.48550/arxiv.hep-ex/0305106
2003
Verification and Diagnostics Framework in ATLAS Trigger/DAQ
Trigger and data acquisition (TDAQ) systems for modern HEP experiments are composed of thousands of hardware and software components depending on each other in a very complex manner. Typically, such systems are operated by non-expert shift operators, which are not aware of system functionality details. It is therefore necessary to help the operator to control the system and to minimize system down-time by providing knowledge-based facilities for automatic testing and verification of system components and also for error diagnostics and recovery. For this purpose, a verification and diagnostic framework was developed in the scope of ATLAS TDAQ. The verification functionality of the framework allows developers to configure simple low-level tests for any component in a TDAQ configuration. The test can be configured as one or more processes running on different hosts. The framework organizes tests in sequences, using knowledge about components hierarchy and dependencies, and allowing the operator to verify the functionality of any subset of the system. The diagnostics functionality includes the possibility to analyze the test results and diagnose detected errors, e.g. by starting additional tests and understanding reasons of failures. A conclusion about system functionality, error diagnosis and recovery advice are presented to the operator in a GUI. The current implementation uses the CLIPS expert system shell for knowledge representation and reasoning.
DOI: 10.1109/tns.2007.913489
2008
Management of Online Processing Farms in the ATLAS Experiment
The ATLAS experiment will use of order three thousand nodes for the online processing farms. The administration of such a large cluster is a challenge. The ability to quickly turn on/off machines, especially after a power cut, and the ability to remote monitor the hardware health whether the machine be on or off are some of the major issues. To solve these problems ATLAS has decided wherever possible to use Intelligent Platform Management Interfaces (IPMI) for its nodes. This paper will present the mechanisms which were developed to allow the distribution of management and monitoring commands to many machines. These commands were run simultaneously on the prototype farm, by taking into account the specificities of the different IPMI versions and implementations, and the network topology. Results from timing measurements for the distribution of commands to many nodes, for booting and for shutting down of the nodes will be shown with an extrapolation to the final cluster size.
DOI: 10.22323/1.313.0075
2018
The FEROL40, a microTCA card interfacing custom point-to-point links and standard TCP/IP
In order to accommodate new back-end electronics of upgraded CMS sub-detectors, a new FEROL40 card in the microTCA standard has been developed. The main function of the FEROL40 is to acquire event data over multiple point-to-point serial optical links, provide buffering, perform protocol conversion, and transmit multiple TCP/IP streams (4x10Gbps) to the Ethernet network of the aggregation layer of the CMS DAQ (data acquisition) event builder. This contribution discusses the design of the FEROL40 and experience from operation
DOI: 10.22323/1.343.0129
2019
Design and development of the DAQ and Timing Hub for CMS Phase-2
The CMS detector will undergo a major upgrade for Phase-2 of the LHC program, starting around 2026.The upgraded Level-1 hardware trigger will select events at a rate of 750 kHz.At an expected event size of 7.4 MB this corresponds to a data rate of up to 50 Tbit/s.Optical links will carry the signals from on-detector front-end electronics to back-end electronics in ATCA crates in the service cavern.A DAQ and Timing Hub board aggregates data streams from back-end boards over point-to-point links, provides buffering and transmits the data to the commercial data-to-surface network for processing and storage.This hub board is also responsible for the distribution of timing, control and trigger signals to the back-ends.This paper presents the current development towards the DAQ and Timing Hub and the design of the first prototype, to be used as for validation and integration with the first back-end prototypes in 2019-2020.
DOI: 10.1051/epjconf/201921401044
2019
Presentation layer of CMS Online Monitoring System
The Compact Muon Solenoid (CMS) is one of the experiments at the CERN Large Hadron Collider (LHC). The CMS Online Monitoring system (OMS) is an upgrade and successor to the CMS Web-Based Monitoring (WBM)system, which is an essential tool for shift crew members, detector subsystem experts, operations coordinators, and those performing physics analyses. The CMS OMS is divided into aggregation and presentation layers. Communication between layers uses RESTful JSON:API compliant requests. The aggregation layer is responsible for collecting data from heterogeneous sources, storage of transformed and pre-calculated (aggregated) values and exposure of data via the RESTful API. The presentation layer displays detector information via a modern, user-friendly and customizable web interface. The CMS OMS user interface is composed of a set of cutting-edge software frameworks and tools to display non-event data to any authenticated CMS user worldwide. The web interface tree-like component structure comprises (top-down): workspaces, folders, pages, controllers and portlets. A clear hierarchy gives the required flexibility and control for content organization. Each bottom element instantiates a portlet and is a reusable component that displays a single aspect of data, like a table, a plot, an article, etc. Pages consist of multiple different portlets and can be customized at runtime by using a drag-and-drop technique. This is how a single page can easily include information from multiple online sources. Different pages give access to a summary of the current status of the experiment, as well as convenient access to historical data. This paper describes the CMS OMS architecture, core concepts and technologies of the presentation layer.
DOI: 10.1051/epjconf/201921401048
2019
A Scalable Online Monitoring System Based on Elasticsearch for Distributed Data Acquisition in Cms
The part of the CMS Data Acquisition (DAQ) system responsible for data readout and event building is a complex network of interdependent distributed applications. To ensure successful data taking, these programs have to be constantly monitored in order to facilitate the timeliness of necessary corrections in case of any deviation from specified behaviour. A large number of diverse monitoring data samples are periodically collected from multiple sources across the network. Monitoring data are kept in memory for online operations and optionally stored on disk for post-mortem analysis. We present a generic, reusable solution based on an open source NoSQL database, Elasticsearch, which is fully compatible and non-intrusive with respect to the existing system. The motivation is to benefit from an offthe-shelf software to facilitate the development, maintenance and support efforts. Elasticsearch provides failover and data redundancy capabilities as well as a programming language independent JSON-over-HTTP interface. The possibility of horizontal scaling matches the requirements of a DAQ monitoring system. The data load from all sources is balanced by redistribution over an Elasticsearch cluster that can be hosted on a computer cloud. In order to achieve the necessary robustness and to validate the scalability of the approach the above monitoring solution currently runs in parallel with an existing in-house developed DAQ monitoring system.
DOI: 10.1051/epjconf/202024501028
2020
DAQExpert the service to increase CMS data-taking efficiency
The Data Acquisition (DAQ) system of the Compact Muon Solenoid (CMS) experiment at the LHC is a complex system responsible for the data readout, event building and recording of accepted events. Its proper functioning plays a critical role in the data-taking efficiency of the CMS experiment. In order to ensure high availability and recover promptly in the event of hardware or software failure of the subsystems, an expert system, the DAQ Expert, has been developed. It aims at improving the data taking efficiency, reducing the human error in the operations and minimising the on-call expert demand. Introduced in the beginning of 2017, it assists the shift crew and the system experts in recovering from operational faults, streamlining the post mortem analysis and, at the end of Run 2, triggering fully automatic recovery without human intervention. DAQ Expert analyses the real-time monitoring data originating from the DAQ components and the high-level trigger updated every few seconds. It pinpoints data flow problems, and recovers them automatically or after given operator approval. We analyse the CMS downtime in the 2018 run focusing on what was improved with the introduction of automated recovery; present challenges and design of encoding the expert knowledge into automated recovery jobs. Furthermore, we demonstrate the web-based, ReactJS interfaces that ensure an effective cooperation between the human operators in the control room and the automated recovery system. We report on the operational experience with automated recovery.
DOI: 10.1016/b978-155860869-6/50091-3
2002
OBK — An Online High Energy Physics' Meta-Data Repository
This chapter explains the role of the Online Book-Keeper (OBK) as one of the main collectors and managers of metadata produced online, how that data is stored, and the interfaces that are provided to access it—merging the physics data with the collected metadata will play an essential role for future analysis and interpretion of the physics events observed at ATLAS. The OBK is a component of ATLAS' Online Software—the system that provides configuration, control and monitoring services to the DAQ (Data AQquisition system). ATLAS is one of the four detectors being built for the Large Hadron Collider (LHC) particle accelerator at CERN to be completed in 2006. After being accelerated, particles will collide inside the detector. From observing, the results of those collision physicists expect to be able to expand the knowledge on the basic contituents of matter.
2000
The Use of Commodity Products in the ATLAS Level-2 Trigger
The ATLAS level-2 trigger has to offer an event rate reduction of approximately 1 in 100, from an input rate of up to 100 kHz. Studies indicate that using geometrical guidance from the level-1 trigger and a sequential selection strategy, this can be achieved using largely commodity products, both for the processors and the communication networks. This paper will present the results of recent studies, indicating where commodity items are now sufficiently powerful and flexible to be used for this demanding real-time task and where custom items — either software or hardware — may still be required.
DOI: 10.1109/tns.2002.803879
2002
Process Management inside ATLAS DAQ
The Process Management component of the online software of the future ATLAS experiment data acquisition system is presented. The purpose of the Process Manager is to perform basic job control of the software components of the data acquisition system. It is capable of starting, stopping and monitoring the status of those components on the data acquisition processors independent of the underlying operating system. Its architecture is designed on the basis of a server client model using CORBA based communication. The server part relies on C++ software agent objects acting as an interface between the local operating system and client applications. Some of the major design challenges of the software agents were to achieve the maximum degree of autonomy possible, to create processes aware of dynamic conditions in their environment and with the ability to determine corresponding actions. Issues such as the performance of the agents in terms of time needed for process creation and destruction, the scalability of the system taking into consideration the final ATLAS configuration and minimizing the use of hardware resources were also of critical importance. Besides the details given on the architecture and the implementation, we also present scalability and performance tests results of the Process Manager system.
DOI: 10.1109/rtc.2005.1547456
2005
Deployment and use of the ATLAS DAQ in the combined test beam
The ATLAS collaboration at CERN operated a combined test beam (CTB) from May until November 2004. The prototype of ATLAS data acquisition system (DAQ) was used to integrate other subsystems into a common CTB setup. Data were collected synchronously from all the ATLAS detectors, which represented nine different detector technologies. Electronics and software of the first level trigger were used to trigger the setup. Event selection algorithms of the high level trigger were integrated with the system and were tested with real detector data. A possibility to operate a remote event filter farm synchronized with ATLAS TDAQ was also tested. Event data, as well as detectors conditions data were made available for offline analysis
2014
10 Gbps TCP/IP streams from the FPGA for High Energy Physics
The DAQ system of the CMS experiment at CERN collects data from more than 600 custom detector Front-End Drivers (FEDs). During 2013 and 2014 the CMS DAQ system will undergo a major upgrade to address the obsolescence of current hardware and the requirements posed by the upgrade of the LHC accelerator and various detector components. For a loss-less data collection from the FEDs a new FPGA based card implementing the TCP/IP protocol suite over 10Gbps Ethernet has been developed. To limit the TCP hardware implementation complexity the DAQ group developed a simplified and unidirectional but RFC 793 compliant version of the TCP protocol. This allows to use a PC with the standard Linux TCP/IP stack as a receiver. We present the challenges and protocol modifications made to TCP in order to simplify its FPGA implementation. We also describe the interaction between the simplified TCP and Linux TCP/IP stack including the performance measurements.
DOI: 10.18429/jacow-icalepcs2015-tua3o01
2015
Detector Controls Meets JEE on the Web
DOI: 10.18429/jacow-icalepcs2015-mopgf025
2015
Enhancing the Detector Control System of the CMS Experiment with Object Oriented Modelling
2015
File-based data flow in the CMS Filter Farm
2015
A scalable monitoring for the CMS Filter Farm based on elasticsearch
A flexible monitoring system has been designed for the CMS File-based Filter Farm making use of modern data mining and analytics components. All the metadata and monitoring information concerning data flow and execution of the HLT are generated locally in the form of small documents using the JSON encoding. These documents are indexed into a hierarchy of elasticsearch (es) clusters along with process and system log information. Elasticsearch is a search server based on Apache Lucene. It provides a distributed, multitenant-capable search and aggregation engine. Since es is schema-free, any new information can be added seamlessly and the unstructured information can be queried in non-predetermined ways. The leaf es clusters consist of the very same nodes that form the Filter Farm thus providing natural horizontal scaling. A separate central” es cluster is used to collect and index aggregated information. The fine-grained information, all the way to individual processes, remains available in the leaf clusters. The central es cluster provides quasi-real-time high-level monitoring information to any kind of client. Historical data can be retrieved to analyse past problems or correlate them with external information. We discuss the design and performance of this system in the context of the CMS DAQ commissioningmore » for LHC Run 2.« less
2015
Online data handling and storage at the CMS experiment
2014
Automating the CMS DAQ
2014
Boosting Event Building Performance using Infiniband FDR for the CMS Upgrade
DOI: 10.1109/rtc.2014.7097439
2014
Achieving high performance with TCP over 40GbE on NUMA architectures for CMS data acquisition
TCP and the socket abstraction have barely changed over the last two decades, but at the network layer there has been a giant leap from a few megabits to 100 gigabits in bandwidth. At the same time, CPU architectures have evolved into the multicore era and applications are expected to make full use of all available resources. Applications in the data acquisition domain based on the standard socket library running in a Non-Uniform Memory Access (NUMA) architecture are unable to reach full efficiency and scalability without the software being adequately aware about the IRQ (Interrupt Request), CPU and memory affinities. During the first long shutdown of LHC, the CMS DAQ system is going to be upgraded for operation from 2015 onwards and a new software component has been designed and developed in the CMS online framework for transferring data with sockets. This software attempts to wrap the low-level socket library to ease higher-level programming with an API based on an asynchronous event driven model similar to the DAT uDAPL API. It is an event-based application with NUMA optimizations, that allows for a high throughput of data across a large distributed system. This paper describes the architecture, the technologies involved and the performance measurements of the software in the context of the CMS distributed event building.
DOI: 10.1088/1742-6596/396/4/042049
2012
Health and performance monitoring of the online computer cluster of CMS
The CMS experiment at the LHC features over 2'500 devices that need constant monitoring in order to ensure proper data taking. The monitoring solution has been migrated from Nagios to Icinga, with several useful plugins. The motivations behind the migration and the selection of the plugins are discussed.
2013
A SCALABLE AND HOMOGENEOUS WEB-BASED SOLUTION FOR PRESENTING CMS CONTROL SYSTEM DATA
The Control System of the CMS experiment ensures the monitoring and safe operation of about 3M parameters. The high demand for access to online and historical Control System Data calls for a scalable solution combining multiple data sources. The advantage of a Web solution is that data can be accessed from everywhere with no application specific software. Moreover, a large pool of freely available components can be reused to achieve a user-friendly and effective data presentation. Access to the online information is provided with minimal impact on the running control system by using a common cache in order to be independent of the number of users. Historical data archived by the SCADA software is accessed via an Oracle Database. The web interfaces provide mostly a read-only access to data but some commands are also allowed. Moreover, developers and experts use web interfaces to deploy the control software and administer the SCADA projects in production. By using an enterprise portal, we profit from single sign-on and role-based access control. Portlets maintained by different developers are centrally integrated into dynamic pages, resulting in a consistent user experience.