ϟ

Tommaso Boccali

Here are all the papers by Tommaso Boccali that you can download and read on OA.mg.
Tommaso Boccali’s last known institution is . Download Tommaso Boccali PDFs here.

Claim this Profile →
DOI: 10.1007/s41781-018-0018-8
2019
Cited 114 times
A Roadmap for HEP Software and Computing R&D for the 2020s
Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.
DOI: 10.1063/5.0044445
2021
Cited 30 times
The novel Mechanical Ventilator Milano for the COVID-19 pandemic
Presented here is the design of the Mechanical Ventilator Milano (MVM), a novel mechanical ventilator designed for rapid mass production in response to the COVID-19 pandemic to address the urgent shortage of intensive therapy ventilators in many countries, and the growing difficulty in procuring these devices through normal supply chains across borders. This ventilator is an electro-mechanical equivalent of the old and reliable Manley Ventilator, and is able to operate in both pressure-controlled and pressure-supported ventilation modes. MVM is optimized for the COVID-19 emergency, thanks to the collaboration with medical doctors in the front line. MVM is designed for large-scale production in a short amount of time and at a limited cost, as it relays on off-the-shelf components, readily available worldwide. Operation of the MVM requires only a source of compressed oxygen (or compressed medical air) and electrical power. Initial tests of a prototype device with a breathing simulator are also presented. Further tests and developments are underway. At this stage the MVM is not yet a certified medical device but certification is in progress.
DOI: 10.1088/1742-6596/2438/1/012039
2023
Extending the distributed computing infrastructure of the CMS experiment with HPC resources
Abstract Particle accelerators are an important tool to study the fundamental properties of elementary particles. Currently the highest energy accelerator is the LHC at CERN, in Geneva, Switzerland. Each of its four major detectors, such as the CMS detector, produces dozens of Petabytes of data per year to be analyzed by a large international collaboration. The processing is carried out on the Worldwide LHC Computing Grid, that spans over more than 170 compute centers around the world and is used by a number of particle physics experiments. Recently the LHC experiments were encouraged to make increasing use of HPC resources. While Grid resources are homogeneous with respect to the used Grid middleware, HPC installations can be very different in their setup. In order to integrate HPC resources into the highly automatized processing setups of the CMS experiment a number of challenges need to be addressed. For processing, access to primary data and metadata as well as access to the software is required. At Grid sites all this is achieved via a number of services that are provided by each center. However at HPC sites many of these capabilities cannot be easily provided and have to be enabled in the user space or enabled by other means. At HPC centers there are often restrictions regarding network access to remote services, which is again a severe limitation. The paper discusses a number of solutions and recent experiences by the CMS experiment to include HPC resources in processing campaigns.
DOI: 10.1051/epjconf/202429508013
2024
ML_INFN project: Status report and future perspectives
The ML_INFN initiative (“ Machine Learning at INFN” ) is an effort to foster Machine Learning (ML) activities at the Italian National Institute for Nuclear Physics (INFN). In recent years, artificial intelligence inspired activities have flourished bottom-up in many efforts in Physics, both at the experimental and theoretical level. Many researchers have procured desktop-level devices, with consumer-oriented GPUs, and have trained themselves in a variety of ways, from webinars, books, and tutorials. ML_INFN aims to help and systematize such effort, in multiple ways: by offering state-of-the-art hardware for ML, leveraging on the INFN Cloud provisioning solutions and thus sharing more efficiently GPUs and leveling the access to such resources to all INFN researchers, and by organizing and curating Knowledge Bases with productiongrade examples from successful activities already in production. Moreover, training events have been organized for beginners, based on existing INFN ML research and focused on flattening the learning curve. In this contribution, we will update the status of the project reporting in particular on the development of tools to take advantage of High-Performance Computing resources provisioned by CNAF and ReCaS computing centers for interactive support to activities and on the organization of the first in-person advanced-level training event, with a GPU-equipped cloud-based environment provided to each participant.
DOI: 10.1051/epjconf/202429510003
2024
ICSC: The Italian National Research Centre on HPC, Big Data and Quantum computing
ICSC (“Italian Center for SuperComputing”) is one of the five Italian National Centres created within the framework of the NextGenerationEU funding by the European Commission. The aim of ICSC, designed and approved through 2022 and eventually started in September 2022, is to create the national digital infrastructure for research and innovation, leveraging existing HPC, HTC and Big Data infrastructures and evolving towards a cloud data-lake model. It will be available to the scientific and industrial communities through flexible and uniform cloud web interfaces and will be relying on a high-level support team; as such, it will form a globally attractive ecosystem based on strategic public-private partnerships to fully exploit top level digital infrastructure for scientific and technical computing and promote the development of new computing technologies. The ICSC IT infrastructure is built upon existing scientific digital infrastructures as provided by the major national players: GARR, the Italian NREN, provides the network infrastructure, whose capacity will be upgraded to multiples of Tbps; CINECA hosts Leonardo, one of the world largest HPC systems, with a power of over 250 Pflops, to be further increased and complemented with a quantum computer; INFN contributes with its distributed Big Data cloud infrastructure, built in the last decades to respond to the needs of the HEP community. On top of the IT infrastructure, several thematic activities will be funded and will focus on the development of tools and applications in several research domains. Of particular relevance to this audience are the activities on "Fundamental Research and Space Economy" and "Astrophysics and Cosmos Observations", strictly aligned with the INFN and HEP core activities. Finally, two technological research activities will foster research on "Future HPC and Big Data" and "Quantum Computing".
DOI: 10.1051/epjconf/202429507030
2024
Migrating the INFN-CNAF datacenter to the Bologna Tecnopolo: A status update
The INFN Tier1 data center is currently located in the premises of the Physics Department of the University of Bologna, where CNAF is also located. During 2023 it will be moved to the “Tecnopolo”, the new facility for research, innovation, and technological development in the same city area; the same location is also hosting Leonardo, the pre-exascale supercomputing machine managed by CINECA, co-financed as part of the EuroHPC Joint Undertaking, 4th ranked in the top500 November 2022 list. The construction of the new CNAF data center consists of two phases, corresponding to the computing requirements of LHC: Phase 1 involves an IT power of 3 MW, and Phase 2, starting from 2025, involves an IT power up to 10 MW. The new data center is designed to cope with the computing requirements of the data taking of the HL-LHC experiments, in the time spanning from 2026 to 2040 and will provide, at the same time, computing services for several other INFN experiments and projects, not only belonging to the HEP domain. The co-location with Leonardo opens wider possibilities to integrate HTC and HPC resources and the new CNAF data center will be tightly coupled with it, allowing access from a single entry point to resources located at CNAF and provided by the supercomputer. Data access from both infrastructures will be transparent to users. In this presentation we describe the new data center design, providing a status update on the migration, and we focus on the Leonardo integration showing the results of the preliminary tests to access it from the CNAF access points.
DOI: 10.1051/epjconf/202429507024
2024
HEPScore: A new CPU benchmark for the WLCG
HEPScore is a new CPU benchmark created to replace the HEPSPEC06 benchmark that is currently used by the WLCG for procurement, computing resource pledges, usage accounting and performance studies. The development of the new benchmark, based on HEP applications or workloads, has involved many contributions from software developers, data analysts, experts of the experiments, representatives of several WLCG computing centres and WLCG site managers. In this contribution, we review the selection of workloads and the validation of the new HEPScore benchmark.
DOI: 10.1051/epjconf/202429511006
2024
Enabling INFN–T1 to support heterogeneous computing architectures
The INFN–CNAF Tier-1 located in Bologna (Italy) is a center of the WLCG e-Infrastructure providing computing power to the four major LHC collaborations and also supports the computing needs of about fifty more groups - also from non HEP research domains. The CNAF Tier1 center has been historically very active putting effort in the integration of computing resources, proposing and prototyping solutions both for extension through Cloud resources, public and private, and with remotely owned sites, as well as developing an integrated HTC+HPC system with the PRACE CINECA supercomputer center located 8Km far from the CNAF Tier-1 located in Bologna. In order to meet the requirements for the new Tecnopolo center, where the CNAF Tier-1 will be hosted, the resource integration activities keep progressing. In particular, this contribution will detail the challenges that have recently been addressed, providing opportunistic access to non standard CPU architectures, such as PowerPC and hardware accelerators (GPUs). We explain the approach adopted to both transparently provision x86_64, ppc64le and NVIDIA V100 GPUs from the Marconi 100 HPC cluster managed by CINECA and to access data from the Tier1 storage system at CNAF. The solution adopted is general enough to enable seamless integration of other computing architectures at the same time from different providers, such as ARM CPUs from the TEXTAROSSA project, and we report about the integration of these within the computing model of the CMS experiment. Finally we will discuss the results of the early experience.
DOI: 10.1109/bdc.2015.33
2015
Cited 11 times
Any Data, Any Time, Anywhere: Global Data Access for Science
Data access is key to science driven by distributed high-throughput computing (DHTC), an essential technology for many major research projects such as High Energy Physics (HEP) experiments. However, achieving efficient data access becomes quite difficult when many independent storage sites are involved because users are burdened with learning the intricacies of accessing each system and keeping careful track of data location. We present an alternate approach: the Any Data, Any Time, Anywhere infrastructure. Combining several existing software products, AAA presents a global, unified view of storage systems - a "data federation," a global filesystem for software delivery, and a workflow management system. We present how one HEP experiment, the Compact Muon Solenoid (CMS), is utilizing the AAA infrastructure and some simple performance metrics.
DOI: 10.1155/2017/7206595
2017
Cited 10 times
Power-Efficient Computing: Experiences from the COSA Project
Energy consumption is today one of the most relevant issues in operating HPC systems for scientific applications. The use of unconventional computing systems is therefore of great interest for several scientific communities looking for a better tradeoff between time-to-solution and energy-to-solution. In this context, the performance assessment of processors with a high ratio of performance per watt is necessary to understand how to realize energy-efficient computing systems for scientific applications, using this class of processors. Computing On SOC Architecture (COSA) is a three-year project (2015–2017) funded by the Scientific Commission V of the Italian Institute for Nuclear Physics (INFN), which aims to investigate the performance and the total cost of ownership offered by computing systems based on commodity low-power Systems on Chip (SoCs) and high energy-efficient systems based on GP-GPUs. In this work, we present the results of the project analyzing the performance of several scientific applications on several GPU- and SoC-based systems. We also describe the methodology we have used to measure energy performance and the tools we have implemented to monitor the power drained by applications while running.
DOI: 10.1016/j.revip.2019.100034
2019
Cited 10 times
Computing models in high energy physics
High Energy Physics Experiments (HEP experiments in the following) have been at least in the last 3 decades at the forefront of technology, in aspects like detector design and construction, number of collaborators, and complexity of data analyses. As uncommon in previous particle physics experiments, the computing and data handling aspects have not been marginal in their design and operations; the cost of the IT related components, from software development to storage systems and to distributed complex e-Infrastructures, has raised to a level which needs proper understanding and planning from the first moments in the lifetime of an experiment. In the following sections we will first try to explore the computing and software solutions developed and operated in the most relevant past and present experiments, with a focus on the technologies deployed; a technology tracking section is presented in order to pave the way to possible solutions for next decade experiments, and beyond. While the focus of this review is on offline computing model, the distinction is a shady one, and some experiments have already experienced contaminations between triggers selection and offline workflows; it is anticipated the trend will continue in the future.
DOI: 10.1016/j.ejmp.2021.10.005
2021
Cited 7 times
Enhancing the impact of Artificial Intelligence in Medicine: A joint AIFM-INFN Italian initiative for a dedicated cloud-based computing infrastructure
Artificial Intelligence (AI) techniques have been implemented in the field of Medical Imaging for more than forty years.Medical Physicists, Clinicians and Computer Scientists have been collaborating since the beginning to realize software solutions to enhance the informative content of medical images, including AI-based support systems for image interpretation.Despite the recent massive progress in this field due to the current emphasis on Radiomics, Machine Learning and Deep Learning, there are still some barriers to overcome before these tools are fully integrated into the clinical workflows to finally enable a precision medicine approach to patients' care.Nowadays, as Medical Imaging has entered the Big Data era, innovative solutions to efficiently deal with huge amounts of data and to exploit large and distributed computing resources are urgently needed.In the framework of a collaboration agreement between the Italian Association of Medical Physicists (AIFM) and the National Institute for Nuclear Physics (INFN), we propose a model of an intensive computing infrastructure, especially suited for training AI models, equipped with secure storage systems, compliant with data protection regulation, which will accelerate the development and extensive validation of AI-based solutions in the Medical Imaging field of research.This solution can be developed and made operational by Physicists and Computer Scientists working on complementary fields of research in Physics, such as High Energy Physics and Medical Physics, who have all the necessary skills to tailor the AI-technology to the needs of the Medical Imaging community and to shorten the pathway towards the clinical applicability of AI-based decision support systems.
DOI: 10.1016/j.micpro.2022.104679
2022
Cited 4 times
Towards EXtreme scale technologies and accelerators for euROhpc hw/Sw supercomputing applications for exascale: The TEXTAROSSA approach
In the near future, Exascale systems will need to bridge three technology gaps to achieve high performance while remaining under tight power constraints: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetic; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA addresses these gaps through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models, and tools derived from European research.
DOI: 10.1051/epjconf/201921407027
2019
Cited 8 times
Exploiting private and commercial clouds to generate on-demand CMS computing facilities with DODAS
Minimising time and cost is key to exploit private or commercial clouds. This can be achieved by increasing setup and operational efficiencies. The success and sustainability are thus obtained reducing the learning curve, as well as the operational cost of managing community-specific services running on distributed environments. The greater beneficiaries of this approach are communities willing to exploit opportunistic cloud resources. DODAS builds on several EOSC-hub services developed by the INDIGO-DataCloud project and allows to instantiate on-demand container-based clusters. These execute software applications to benefit of potentially “any cloud provider”, generating sites on demand with almost zero effort. DODAS provides ready-to-use solutions to implement a “Batch System as a Service” as well as a BigData platform for a “Machine Learning as a Service”, offering a high level of customization to integrate specific scenarios. A description of the DODAS architecture will be given, including the CMS integration strategy adopted to connect it with the experiment’s HTCondor Global Pool. Performance and scalability results of DODAS-generated tiers processing real CMS analysis jobs will be presented. The Instituto de Física de Cantabria and Imperial College London use cases will be sketched. Finally a high level strategy overview for optimizing data ingestion in DODAS will be described.
DOI: 10.1088/1742-6596/1085/3/032055
2018
Cited 8 times
Exploiting Apache Spark platform for CMS computing analytics
The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
DOI: 10.1088/1742-6596/513/5/052008
2014
Cited 7 times
Explorations of the viability of ARM and Xeon Phi for physics processing
We report on our investigations into the viability of the ARM processor and the Intel Xeon Phi co-processor for scientific computing. We describe our experience porting software to these processors and running benchmarks using real physics applications to explore the potential of these processors for production physics processing.
DOI: 10.1088/1742-6596/664/3/032003
2015
Cited 6 times
Exploiting CMS data popularity to model the evolution of data management for Run-2 and beyond
During the LHC Run-1 data taking, all experiments collected large data volumes from proton-proton and heavy-ion collisions. The collisions data, together with massive volumes of simulated data, were replicated in multiple copies, transferred among various Tier levels, transformed/slimmed in format/content. These data were then accessed (both locally and remotely) by large groups of distributed analysis communities exploiting the WorldWide LHC Computing Grid infrastructure and services. While efficient data placement strategies - together with optimal data redistribution and deletions on demand - have become the core of static versus dynamic data management projects, little effort has so far been invested in understanding the detailed data-access patterns which surfaced in Run-1. These patterns, if understood, can be used as input to simulation of computing models at the LHC, to optimise existing systems by tuning their behaviour, and to explore next-generation CPU/storage/network co-scheduling solutions. This is of great importance, given that the scale of the computing problem will increase far faster than the resources available to the experiments, for Run-2 and beyond. Studying data-access patterns involves the validation of the quality of the monitoring data collected on the "popularity of each dataset, the analysis of the frequency and pattern of accesses to different datasets by analysis end-users, the exploration of different views of the popularity data (by physics activity, by region, by data type), the study of the evolution of Run-1 data exploitation over time, the evaluation of the impact of different data placement and distribution choices on the available network and storage resources and their impact on the computing operations. This work presents some insights from studies on the popularity data from the CMS experiment. We present the properties of a range of physics analysis activities as seen by the data popularity, and make recommendations for how to tune the initial distribution of data in anticipation of how it will be used in Run-2 and beyond.
2019
Cited 6 times
A roadmap for HEP software and computing R&D for the 2020s
Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.
DOI: 10.1051/epjconf/201921408002
2019
Cited 5 times
INFN Tier–1: a distributed site
The INFN Tier-1 center at CNAF has been extended in 2016 and 2017 in order to include a small amount of resources (∼ 22 kHS06 corresponding to ∼ 10% of the CNAF pledges for LHC in 2017) physically located at the Bari-ReCas site (∼ 600 km distant from CNAF). In 2018, a significant fraction of the CPU power (∼ 170 kHS06, equivalent to ∼ 50% of the total CNAF pledges) is going to be provided via a collaboration with the PRACE Tier-0 CINECA center (a few km from CNAF), thus building a truly geographically distributed (WAN) center. The two sites are going to be interconnected via an high bandwidth link (400-1200 Gb/s), in order to ensure a transparent access to data residing on CNAF storage; the latency between the centers is small enough not to require particular caching strategies. In this contribution we describe the issues and the results of the production configuration, focusing both on the management aspects and on the performance provided to end-users.
DOI: 10.1109/nssmic.2004.1462661
2005
Cited 7 times
An object-oriented simulation program for CMS
The CMS detector simulation package, OSCAR, is based on the Geant4 simulation toolkit and the CMS object-oriented framework for simulation and reconstruction. Geant4 provides a rich set of physics processes describing in detail electromagnetic and hadronic interactions. It also provides the tools for the implementation of the full CMS detector geometry and the interfaces required for recovering information from the particle tracking in the detectors. This functionality is interfaced to the CMS framework, which, via its "action on demand" mechanisms, allows the user to selectively load desired modules and to configure and tune the final application. The complete CMS detector is rather complex with more than 1 million geometrical volumes. OSCAR has been validated by comparing its results with test beam data and with results from simulation with a GEANT3-based program. It has been successfully deployed in the 2004 data challenge for CMS, where more than 35 million events for various LHC physics channels were simulated and analysed.
2018
Cited 4 times
arXiv : HEP Community White Paper on Software trigger and event reconstruction
Realizing the physics programs of the planned and upgraded high-energy physics (HEP) experiments over the next 10 years will require the HEP community to address a number of challenges in the area of software and computing. For this reason, the HEP software community has engaged in a planning process over the past two years, with the objective of identifying and prioritizing the research and development required to enable the next generation of HEP detectors to fulfill their full physics potential. The aim is to produce a Community White Paper which will describe the community strategy and a roadmap for software and computing research and development in HEP for the 2020s. The topics of event reconstruction and software triggers were considered by a joint working group and are summarized together in this document.
DOI: 10.48550/arxiv.2008.13636
2020
Cited 4 times
HL-LHC Computing Review: Common Tools and Community Software
Common and community software packages, such as ROOT, Geant4 and event generators have been a key part of the LHC's success so far and continued development and optimisation will be critical in the future. The challenges are driven by an ambitious physics programme, notably the LHC accelerator upgrade to high-luminosity, HL-LHC, and the corresponding detector upgrades of ATLAS and CMS. In this document we address the issues for software that is used in multiple experiments (usually even more widely than ATLAS and CMS) and maintained by teams of developers who are either not linked to a particular experiment or who contribute to common software within the context of their experiment activity. We also give space to general considerations for future software and projects that tackle upcoming challenges, no matter who writes it, which is an area where community convergence on best practice is extremely useful.
DOI: 10.1051/epjconf/202024504024
2020
Cited 4 times
Smart Caching at CMS: applying AI to XCache edge services
The projected Storage and Compute needs for the HL-LHC will be a factor up to 10 above what can be achieved by the evolution of current technology within a flat budget. The WLCG community is studying possible technical solutions to evolve the current computing in order to cope with the requirements; one of the main focus is resource optimization, with the ultimate aim of improving performance and efficiency, as well as simplifying and reducing operation costs. As of today the storage consolidation based on a Data Lake model is considered a good candidate for addressing HL-LHC data access challenges. The Data Lake model under evaluation can be seen as a logical system that hosts a distributed working set of analysis data. Compute power can be “close” to the lake, but also remote and thus completely external. In this context we expect data caching to play a central role as a technical solution to reduce the impact of latency and reduce network load. A geographically distributed caching layer will be functional to many satellite computing centers that might appear and disappear dynamically. In this talk we propose a system of caches, distributed at national level, describing both deployment and results of the studies made to measure the impact on the CPU efficiency. In this contribution, we also present the early results on novel caching strategy beyond the standard XRootD approach whose results will be a baseline for an AI-based smart caching system.
DOI: 10.1051/epjconf/202024509009
2020
Cited 4 times
Extension of the INFN Tier-1 on a HPC system
The INFN Tier-1 located at CNAF in Bologna (Italy) is a center of the WLCG e-Infrastructure, supporting the 4 major LHC collaborations and more than 30 other INFN-related experiments. After multiple tests towards elastic expansion of CNAF compute power via Cloud resources (provided by Azure, Aruba and in the framework of the HNSciCloud project), and building on the experience gained with the production quality extension of the Tier-1 farm on remote owned sites, the CNAF team, in collaboration with experts from the ALICE, ATLAS, CMS, and LHCb experiments, has been working to put in production a solution of an integrated HTC+HPC system with the PRACE CINECA center, located nearby Bologna. Such extension will be implemented on the Marconi A2 partition, equipped with Intel Knights Landing (KNL) processors. A number of technical challenges were faced and solved in order to successfully run on low RAM nodes, as well as to overcome the closed environment (network, access, software distribution, ... ) that HPC systems deploy with respect to standard GRID sites. We show preliminary results from a large scale integration effort, using resources secured via the successful PRACE grant N. 2018194658, for 30 million KNL core hours.
DOI: 10.48550/arxiv.2201.09260
2022
Materials and devices for fundamental quantum science and quantum technologies
Technologies operating on the basis of quantum mechanical laws and resources such as phase coherence and entanglement are expected to revolutionize our future. Quantum technologies are often divided into four main pillars: computing, simulation, communication, and sensing & metrology. Moreover, a great deal of interest is currently also nucleating around energy-related quantum technologies. In this Perspective, we focus on advanced superconducting materials, van der Waals materials, and moir\'e quantum matter, summarizing recent exciting developments and highlighting a wealth of potential applications, ranging from high-energy experimental and theoretical physics to quantum materials science and energy storage.
DOI: 10.1016/s0168-9002(03)00544-8
2003
Cited 7 times
Simulation framework and XML detector description for the CMS experiment
Currently CMS event simulation is based on GEANT3 while the detector description is built from different sources for simulation and reconstruction. A new simulation framework based on GEANT4 is under development. A full description of the detector is available, and the tuning of the GEANT4 performance and the checking of the ability of the physics processes to describe the detector response is ongoing. Its integration on the CMS mass production system and GRID is also currently under development. The Detector Description Database project aims at providing a common source of information for Simulation, Reconstruction, Analysis, and Visualisation, while allowing for different representations as well as specific information for each application. A functional prototype, based on XML, is already released. Also examples of the integration of DDD in the GEANT4 simulation and in the reconstruction applications are provided.
DOI: 10.1088/1742-6596/898/7/072027
2017
Cited 3 times
XRootD popularity on hadoop clusters
Performance data and metadata of the computing operations at the CMS experiment are collected through a distributed monitoring infrastructure, currently relying on a traditional Oracle database system. This paper shows how to harness Big Data architectures in order to improve the throughput and the efficiency of such monitoring. A large set of operational data - user activities, job submissions, resources, file transfers, site efficiencies, software releases, network traffic, machine logs - is being injected into a readily available Hadoop cluster, via several data streamers. The collected metadata is further organized running fast arbitrary queries; this offers the ability to test several Map&Reduce-based frameworks and measure the system speed-up when compared to the original database infrastructure. By leveraging a quality Hadoop data store and enabling an analytics framework on top, it is possible to design a mining platform to predict dataset popularity and discover patterns and correlations.
DOI: 10.22323/1.327.0024
2018
Cited 3 times
DODAS: How to effectively exploit heterogeneous clouds for scientific computations
Dynamic On Demand Analysis Service (DODAS) is a Platform as a Service tool built combining several solutions and products developed by the INDIGO-DataCloud H2020 project.DODAS allows to instantiate on-demand container-based clusters.Both HTCondor batch system and platform for the Big Data analysis based on Spark, Hadoop etc, can be deployed on any cloud-based infrastructures with almost zero effort.DODAS acts as cloud enabler designed for scientists seeking to easily exploit distributed and heterogeneous clouds to process data.Aiming to reduce the learning curve as well as the operational cost of managing community specific services running on distributed cloud, DODAS completely automates the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources.DODAS was selected as one of the Thematic Services that will provide multidisciplinary solutions in the EOSC-hub project, an integration and management system of the European Open Science Cloud starting in January 2018.The main goals of this contribution are to provide a comprehensive overview of the overall technical implementation of DODAS, as well as to illustrate two distinct real examples of usage: the integration within the CMS Workload Management System and the extension of the AMS computing model.
DOI: 10.1088/1742-6596/1525/1/012057
2020
Cited 3 times
Using DODAS as deployment manager for smart caching of CMS data management system
Abstract DODAS stands for Dynamic On Demand Analysis Service and is a Platform as a Service toolkit built around several EOSC-hub services designed to instantiate and configure on-demand container-based clusters over public or private Cloud resources. It automates the whole workflow from service provisioning to the configuration and setup of software applications. Therefore, such a solution allows using “any cloud provider”, with almost zero effort. In this paper, we demonstrate how DODAS can be adopted as a deployment manager to set up and manage the compute resources and services required to develop an AI solution for smart data caching. The smart caching layer may reduce the operational cost and increase flexibility with respect to regular centrally managed storage of the current CMS computing model. The cache space should be dynamically populated with the most requested data. In addition, clustering such caching systems will allow to operate them as a Content Delivery System between data providers and end-users. Moreover, a geographically distributed caching layer will be functional also to a data-lake based model, where many satellite computing centers might appear and disappear dynamically. In this context, our strategy is to develop a flexible and automated AI environment for smart management of the content of such clustered cache system. In this contribution, we will describe the identified computational phases required for the AI environment implementation, as well as the related DODAS integration. Therefore we will start with the overview of the architecture for the pre-processing step, based on Spark, which has the role to prepare data for a Machine Learning technique. A focus will be given on the automation implemented through DODAS. Then, we will show how to train an AI-based smart cache and how we implemented a training facility managed through DODAS. Finally, we provide an overview of the inference system, based on the CMS-TensorFlow as a Service and also deployed as a DODAS service.
DOI: 10.1007/s41781-020-00052-w
2021
Cited 3 times
Dynamic Distribution of High-Rate Data Processing from CERN to Remote HPC Data Centers
Abstract The prompt reconstruction of the data recorded from the Large Hadron Collider (LHC) detectors has always been addressed by dedicated resources at the CERN Tier-0. Such workloads come in spikes due to the nature of the operation of the accelerator and in special high load occasions experiments have commissioned methods to distribute (spill-over) a fraction of the load to sites outside CERN. The present work demonstrates a new way of supporting the Tier-0 environment by provisioning resources elastically for such spilled-over workflows onto the Piz Daint Supercomputer at CSCS. This is implemented using containers, tuning the existing batch scheduler and reinforcing the scratch file system, while still using standard Grid middleware. ATLAS, CMS and CSCS have jointly run selected prompt data reconstruction on up to several thousand cores on Piz Daint into a shared environment, thereby probing the viability of the CSCS high performance computer site as on demand extension of the CERN Tier-0, which could play a role in addressing the future LHC computing challenges for the high luminosity LHC.
2015
Any Data, Any Time, Anywhere: Global Data Access for Science
Data access is key to science driven by distributed high-throughput computing (DHTC), an essential technology for many major research projects such as High Energy Physics (HEP) experiments. However, achieving efficient access becomes quite difficult when many independent storage sites are involved because users are burdened with learning the intricacies of accessing each system and keeping careful track of location. We present an alternate approach: the Any Data, Any Time, Anywhere infrastructure. Combining several existing software products, AAA presents a global, unified view of storage systems - a data federation, a global filesystem for software delivery, and a workflow management system. We present how one HEP experiment, the Compact Muon Solenoid (CMS), is utilizing the AAA infrastructure and some simple performance metrics.
DOI: 10.1088/1742-6596/664/2/022029
2015
Docker experience at INFN-Pisa Grid Data Center
Clouds and virtualization offer typical answers to the needs of large-scale computing centers to satisfy diverse sets of user communities in terms of architecture, OS, etc. On the other hand, solutions like Docker seems to emerge as a way to rely on Linux kernel capabilities to package only the applications and the development environment needed by the users, thus solving several resource management issues related to cloud-like solutions. In this paper, we present an exploratory (though well advanced) test done at a major Italian Tier2, at INFN-Pisa, where a considerable fraction of the resources and services has been moved to Docker. The results obtained are definitely encouraging, and Pisa is transitioning all of its Worker Nodes and services to Docker containers. Work is currently being expanded into the preparation of suitable images for a completely virtualized Tier2, with no dependency on local configurations.
DOI: 10.1016/j.nima.2006.09.081
2007
Cited 3 times
First level trigger using pixel detector for the CMS experiment
A proposal for a pixel-based Level 1 trigger for the Super-LHC is presented. The trigger is based on fast track reconstruction using the full pixel granularity exploiting a readout which connects different layers in specific trigger towers. The trigger will implement the current CMS high level trigger functionality in a novel concept of intelligent detector. A possible layout is discussed and implications on data links are evaluated.
DOI: 10.1088/1742-6596/2438/1/012031
2023
Enabling CMS Experiment to the utilization of multiple hardware architectures: a Power9 Testbed at CINECA
Abstract The CMS software stack (CMSSW) is built on a nightly basis for multiple hardware architectures and compilers, in order to benefit from the diverse platforms. In practice, still, only x86_64 binaries are used in production, and are supported by the workload management tools in charge of production and analysis job delivery to the distributed computing infrastructure. Profiting from an INFN grant at CINECA, a PRACE Tier-0 Center, tests have been carried on using IBM Power9 nodes from the Marconi100 HPC system. A first study on the modifications needed to the standard CMS WMS systems is shown, and very positive proof-of-concept tests have been conducted up to thousands of computing cores, also including an initial utilization of the GPUs which the nodes host. The current status of the tests, including plans to support multi-architecture workflows, are shown and discussed.
DOI: 10.1201/9781003023920-19
2023
Machine learning for Monte Carlo simulations
DOI: 10.22323/1.270.0031
2017
Elastic CNAF DataCenter extension via opportunistic resources
The Computing facility CNAF, in Bologna (Italy), is the biggest WLCG Computing Center in Italy, and serves all WLCG Experiments plus more than 20 non-WLCG Virtual Organizations and currently deploys more than 200 kHS06 of Computing Power and more than 20 PB of Disk and 40 PB of tape via a GPFS SAN. The Center has started a program to evaluate the possibility to extend its resources on external entities, either commercial or opportunistic or simply remote, in order to be prepared for future upgrades or temporary burst in the activity from experiments. The approach followed is meant to be completely transparent to users, with additional external resources directly added to the CNAF LSF batch system; several variants are possible, like the use of VPN tunnels in order to establish LSF communications between hosts, a multi-master LSF approach, or in the longer term the use of HTCondor. Concerning the storage, the simplest approach is to use Xrootd fallback to CNAF storage, unfortunately viable only for some experiments; a more transparent approach involves the use of GPFS/AFM module in order to cache files directly on the remote facilities. In this paper we focus on the technical aspects of the integration, and assess the difficulties using different remote virtualisation technologies, as made available at different sites. A set of benchmarks is provided in order to allow for an evaluation of the solution for CPU and Data intensive workflows. The evaluation of Aruba as a resource provider for CNAF is under test, with limited available resources; a ramp up to a larger scale is being discussed. On a parallel path, this paper shows a similar attempt of extension using proprietary resources, at ReCaS-Bari; the chosen solution is simpler in the setup, but shares many commonalities.
DOI: 10.3233/978-1-61499-843-3-770
2017
The INFN COSA Project: Low-Power Computing and Storage
DOI: 10.1051/epjconf/201921403055
2019
CMS Computing Resources: Meeting the demands of the high-luminosity LHC physics program
The high-luminosity program has seen numerous extrapolations of its needed computing resources that each indicate the need for substantial changes if the desired HL-LHC physics program is to be supported within the current level of computing resource budgets. Drivers include large increases in event complexity (leading to increased processing time and analysis data size) and trigger rates needed (5-10 fold increases) for the HL-LHC program. The CMS experiment has recently undertaken an effort to merge the ideas behind short-term and long-term resource models in order to make easier and more reliable extrapolations to future needs. Near term computing resource estimation requirements depend on numerous parameters: LHC uptime and beam intensities; detector and online trigger performance; software performance; analysis data requirements; data access, management, and retention policies; site characteristics; and network performance. Longer term modeling is affected by the same characteristics, but with much larger uncertainties that must be considered to understand the most interesting handles for increasing the "physics per computing dollar" of the HL-LHC. In this presentation, we discuss the current status of long term modeling of the CMS computing resource needs for HL-LHC with emphasis on techniques for extrapolations, uncertainty quantification, and model results. We illustrate potential ways that high-luminosity CMS could accomplish its desired physics program within today's computing budgets.
DOI: 10.1051/epjconf/201921403019
2019
System Performance and Cost Modelling in LHC computing
The increase in the scale of LHC computing expected for Run 3 and even more so for Run 4 (HL-LHC) over the next ten years will certainly require radical changes to the computing models and the data processing of the LHC experiments. Translating the requirements of the physics programmes into computing resource needs is a complicated process and subject to significant uncertainties. For this reason, WLCG has established a working group to develop methodologies and tools intended tocharacterise the LHC workloads, better understand their interaction with the computing infrastructure, calculate their cost in terms of resources and expenditure and assist experiments, sites and the WLCG project in the evaluation of their future choices. This working group started in November 2017 and has about 30 active participants representing experiments and sites. In this contribution we expose the activities, the results achieved and the future directions.
DOI: 10.22323/1.351.0014
2019
Integration of the Italian cache federation within the CMS computing model
The next decades at HL-LHC will be characterized by a huge increase of both storage and computing requirements (between one and two orders of magnitude). Moreover we foresee a shift on resources provisioning towards the exploitation of dynamic (on private or public cloud and HPC facilities) solutions. In this scenario the computing model of the CMS experiment is pushed towards an evolution for the optimization of the amount of space that is managed centrally and the CPU efficiency of the jobs that run on "storage-less" resources. In particular the computing resources of the "Tier2" sites layer, for the most part, can be instrumented to read data from a geographically distributed cache storage based on unmanaged resources, reducing, in this way, the operational efforts by a large fraction and generating additional flexibility. The objective of this contribution is to present the first implementation of an INFN federation of cache servers, developed also in collaboration with the eXtreme Data Cloud EU project. The CNAF Tier-1 plus Bari and Legnaro Tier-2s provide unmanaged storages which have been organized under a common namespace. This distributed cache federation has been seamlessly integrated in the CMS computing infrastructure, while the technical implementation of this solution is based on XRootD, largely adopted in the CMS computing model under the "Anydata, Anytime, Anywhere project" (AAA). The results in terms of CMS workflows performances will be shown. In addition a complete simulation of the effects of the described model under several scenarios, including dynamic hybrid cloud resource provisioning, will be discussed. Finally a plan for the upgrade of such a prototype towards a stable INFN setup seamlessly integrated with production CMS computing infrastructure will be discussed.
DOI: 10.1109/dsd53832.2021.00051
2021
TEXTAROSSA: Towards EXtreme scale Technologies and Accelerators for euROhpc hw/Sw Supercomputing Applications for exascale
To achieve high performance and high energy efficiency on near-future exascale computing systems, three key technology gaps needs to be bridged. These gaps include: energy efficiency and thermal control; extreme computation efficiency via HW acceleration and new arithmetics; methods and tools for seamless integration of reconfigurable accelerators in heterogeneous HPC multi-node platforms. TEXTAROSSA aims at tackling this gap through a co-design approach to heterogeneous HPC solutions, supported by the integration and extension of HW and SW IPs, programming models and tools derived from European research.
DOI: 10.22323/1.378.0002
2021
A possible solution for HEP processing on network secluded Computing Nodes
The computing needs of LHC experiments in the next decades (the so-called High Luminosity LHCm HL-LHC) are expected to increase substantially, due to the concurrent increases in the accelerator luminosity, in the selection rates and in the detectors' complexity. Many Funding Agencies are aiming to a consolidation of the national LHC computing infrastructures, via a merge with other large scale computing facilites such as HPC and Cloud centers. The LHC Experiments have started long ago tests and production activities on such centers, with intermittent success. The biggest obstacle comes from their typical stricter network policies with respect to our standard centers, which do not allow an easy merge with the distributed LHC computing infrastructure. A possible solution for such centers is presented here, able to satisfy three main goals: be user deployable, be a catch-all solution for all the protocols and services, and be transparent to the experiment software stack. It is based on the integration of existing tools like tsocks, tunsocks, openconnect, cvmfsexec and singularity. We present results from an early experimentation, which positively show how the solution is indeed usable. Large scale testing on thousands of nodes is the next step in our agenda.
DOI: 10.1088/1742-6596/513/4/042023
2014
Xrootd data access for LHC experiments at the INFN-CNAF Tier-1
The Mass Storage System installed at the INFN-CNAF Tier-1 is one of the biggest hierarchical storage facilities in Europe. It currently provides storage resources for about 12% of all LHC data, as well as for other experiments. The Grid Enabled Mass Storage System (GEMSS) is the current solution implemented at CNAF and it is based on a custom integration between a high performance parallel file system (General Parallel File System, GPFS) and a tape management system for long-term storage on magnetic media (Tivoli Storage Manager, TSM). Data access to Grid users is being granted since several years by the Storage Resource Manager (StoRM), an implementation of the standard SRM interface, widely adopted within the WLCG community. The evolving requirements from the LHC experiments and other users are leading to the adoption of more flexible methods for accessing the storage. These include the implementation of the so-called storage federations, i.e. geographically distributed federations allowing direct file access to the federated storage between sites. A specific integration between GEMSS and Xrootd has been developed at CNAF to match the requirements of the CMS experiment. This was already implemented for the ALICE use case, using ad-hoc Xrootd modifications. The new developments for CMS have been validated and are already available in the official Xrootd builds. This integration is currently in production and appropriate large scale tests have been made. In this paper we present the Xrootd solutions adopted for ALICE, CMS, ATLAS and LHCb to increase the availability and optimize the overall performance.
DOI: 10.1016/j.nima.2005.11.244
2006
Edgeless silicon pad detectors
We report measurements in a high-energy pion beam of the sensitivity of the edge region in “edgeless” planar silicon pad diode detectors diced through their contact implants. A large surface current on such an edge prevents the normal reverse biasing of the device, but the current can be sufficiently reduced by the use of a suitable cutting method, followed by edge treatment, and by operating the detector at low temperature. The depth of the dead layer at the diced edge is measured to be (12.5±8stat..±6syst.) μm.
DOI: 10.48550/arxiv.physics/0306014
2003
Vertex reconstruction framework and its implementation for CMS
The class framework developed for vertex reconstruction in CMS is described. We emphasize how we proceed to develop a flexible, efficient and reliable piece of reconstruction software. We describe the decomposition of the algorithms into logical parts, the mathematical toolkit, and the way vertex reconstruction integrates into the CMS reconstruction project ORCA. We discuss the tools that we have developed for algorithm evaluation and optimization and for code release.
2018
arXiv : HEP Community White Paper on Software trigger and event reconstruction: Executive Summary
Realizing the physics programs of the planned and upgraded high-energy physics (HEP) experiments over the next 10 years will require the HEP community to address a number of challenges in the area of software and computing. For this reason, the HEP software community has engaged in a planning process over the past two years, with the objective of identifying and prioritizing the research and development required to enable the next generation of HEP detectors to fulfill their full physics potential. The aim is to produce a Community White Paper which will describe the community strategy and a roadmap for software and computing research and development in HEP for the 2020s. The topics of event reconstruction and software triggers were considered by a joint working group and are summarized together in this document.
DOI: 10.1109/escience.2018.00082
2018
Distributed and On-demand Cache for CMS Experiment at LHC
In the CMS [1] computing model the experiment owns dedicated resources around the world that, for the most part, are located in computing centers with a well defined Tier hierarchy. The geo-distributed storage is then controlled centrally by the CMS Computing Operations. In this architecture data are distributed and replicated across the centers following a preplacement model, mostly human controlled. Analysis jobs are then mostly executed on computing resources close to the data location. This of course allow to avoid CPU wasting due to I/O latency, although it does not allow to optimize the available job slots.
2007
The 2003 tracker inner barrel beam test
DOI: 10.1051/epjconf/201921403001
2019
The INFN scientific computing infrastructure: present status and future evolution
The INFN scientific computing infrastructure is composed of more than 30 sites, ranging from CNAF (Tier-1 for LHC and main data center for nearly 30 other experiments) and nine LHC Tier-2s, to ∼ 20 smaller sites, including LHC Tier-3s and not-LHC experiment farms. A comprehensive review of the installed resources, together with plans for the near future, has been collected during the second half of 2017, and provides a general view of the infrastructure, its costs and its potential for expansions; it also shows the general trends in software and hardware solutions utilized in a complex reality as INFN. As of the end of 2017, the total installed CPU power exceeded 800 kHS06 (∼ 80,000 cores) while the total storage net capacity was over 57 PB on disk and 97 PB on tape: the vast majority of resources (95% of cores and 95% of storage) are concentrated in the 16 largest centers. Future evolutions are explored and are towards the consolidation into big centers; this has required a rethinking of the access policies and protocols in order to enable diverse scientific communities, beyond LHC, to fruitfully exploit the INFN resources. On top of that, such an infrastructure will be used beyond INFN experiments, and will be part of the Italian infrastructure, comprising other research institutes, universities and HPC centers.
DOI: 10.22323/1.351.0020
2019
The BondMachine toolkit: Enabling Machine Learning on FPGA
The BondMachine (BM) is an innovative prototype software ecosystem aimed at creating facilities where both hardware and software are co-designed, guaranteeing a full exploitation of fabric capabilities (both in terms of concurrency and heterogeneity) with the smallest possible power dissipation.In the present paper we will provide a technical overview of the key aspects of the BondMachine toolkit, highlighting the advancements brought about by the porting of Go code in hardware.We will then show a cloud-based BM as a Service deployment.Finally, we will focus on TensorFlow, and in this context we will show how we plan to benchmark the system with a ML tracking reconstruction from pp collision at the LHC.
DOI: 10.1088/1742-6596/1525/1/012037
2020
CMS Software and Offline preparation for future runs
Abstract The next LHC Runs, nominally Run III and Run IV, pose problems to the offline and computing systems in CMS. Run IV in particular will need completely different solutions, given the current estimates of LHC conditions and Trigger estimates. We report on the R&D process CMS has established, in order to gain insight on the needs and the possible solutions for the 2020+ CMS computing.
DOI: 10.1109/nssmic.2006.354216
2006
The CMS Simulation Software
In this paper we present the features and the expected performance of the re-designed CMS simulation software, as well as the experience from the migration process. Today, the CMS simulation suite is based on the two principal components - Geant4 detector simulation toolkit and the new CMS offline Framework and Event Data Model. The simulation chain includes event generation, detector simulation, and digitization steps. With Geant4, we employ the full set of electromagnetic and hadronic physics processes and detailed particle tracking in the 4 Tesla magnetic field. The Framework provides "action on demand" mechanisms, to allow users to load dynamically the desired modules and to configure and tune the final application at the run time. The simulation suite is used to model the complete central CMS detector (over 1 million of geometrical volumes) and the forward systems, such as Castor calorimeter and Zero Degree Calorimeter, the Totem telescopes, Roman Pots, and the Luminosity Monitor. The designs also previews the use of the electromagnetic and hadronic showers parametrization, instead of full modelling of high energy particles passage through a complex hierarchy of volumes and materials, allowing significant gain in speed while tuning the simulation to test beam and collider data. Physics simulation has been extensively validated by comparison with test beam data and previous simulation results. The redesigned and upgraded simulation software was exercised for performance and robustness tests. It went into Production in July 2006, running in the US and EU grids, and has since delivered about 60 millions of events.
DOI: 10.5281/zenodo.4769703
2021
Practical Guide to Sustainable Research Data
2004
Mantis: the Geant4-based simulation specialization of the CMS COBRA framework
DOI: 10.1016/j.nima.2005.05.065
2005
Edge sensitivity of “edgeless” silicon pad detectors measured in a high-energy beam
Abstract We report measurements in a high-energy beam of the sensitivity of the edge region in “edgeless” planar silicon pad diode detectors. The edgeless side of these rectangular diodes is formed by a cut and break through the contact implants. A large surface current on such an edge prevents the normal reverse biasing of this device above the full depletion voltage, but we have shown that the current can be sufficiently reduced by the use of a suitable cutting method, followed by edge treatment, and by operating the detector at a low temperature. A pair of these edgeless silicon diode pad sensors was exposed to the X5 high-energy pion beam at CERN, to determine the edge sensitivity. The signal of the detector pair triggered a reference telescope made of silicon microstrip detector modules. The gap width between the edgeless sensors, determined using the tracks measured by the reference telescope, was then compared with the results of precision metrology. It was concluded that the depth of the dead layer at the diced edge is compatible with zero within the statistical precision of ±8 μm and systematic error of ±6 μm.
DOI: 10.1016/s0168-9002(03)01781-9
2003
High level Trigger at CMS with the Tracker
The CMS Collaboration has chosen for the Level 2/3 Trigger a design which uses a farm of commodity PCs. A selection based on the CMS Tracker, not used at Level 1, can be performed on all of the data sent to Level 2/3. The performance is comparable to the performance achieved with offline reconstruction code.
DOI: 10.1088/1742-6596/513/3/032021
2014
Preserving access to ALEPH computing environment via virtual machines
The ALEPH Collaboration [1] took data at the LEP (CERN) electron-positron collider in the period 1989-2000, producing more than 300 scientific papers. While most of the Collaboration activities stopped in the last years, the data collected still has physics potential, with new theoretical models emerging, which ask checks with data at the Z and WW production energies. An attempt to revive and preserve the ALEPH Computing Environment is presented; the aim is not only the preservation of the data files (usually called bit preservation), but of the full environment a physicist would need to perform brand new analyses. Technically, a Virtual Machine approach has been chosen, using the VirtualBox platform. Concerning simulated events, the full chain from event generators to physics plots is possible, and reprocessing of data events is also functioning. Interactive tools like the DALI event display can be used on both data and simulated events. The Virtual Machine approach is suited for both interactive usage, and for massive computing using Cloud like approaches.
DOI: 10.1088/1742-6596/513/3/032079
2014
CMS users data management service integration and first experiences with its NoSQL data storage
The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site.
DOI: 10.1088/1742-6596/513/4/042013
2014
An Xrootd Italian Federation
The Italian community in CMS has built a geographically distributed network in which all the data stored in the Italian region are available to all the users for their everyday work. This activity involves at different level all the CMS centers: the Tier1 at CNAF, all the four Tier2s (Bari, Rome, Legnaro and Pisa), and few Tier3s (Trieste, Perugia, Torino, Catania, Napoli, ...). The federation uses the new network connections as provided by GARR, our NREN (National Research and Education Network), which provides a minimum of 10 Gbit/s to all the sites via the GARR-X[2] project. The federation is currently based on Xrootd[1] technology, and on a Redirector aimed to seamlessly connect all the sites, giving the logical view of a single entity. A special configuration has been put in place for the Tier1, CNAF, where ad-hoc Xrootd changes have been implemented in order to protect the tape system from excessive stress, by not allowing WAN connections to access tape only files, on a file-by-file basis. In order to improve the overall performance while reading files, both in terms of bandwidth and latency, a hierarchy of xrootd redirectors has been implemented. The solution implemented provides a dedicated Redirector where all the INFN sites are registered, without considering their status (T1, T2, or T3 sites). An interesting use case were able to cover via the federation are disk-less Tier3s. The caching solution allows to operate a local storage with minimal human intervention: transfers are automatically done on a single file basis, and the cache is maintained operational by automatic removal of old files.
DOI: 10.1088/1742-6596/513/3/032064
2014
Experience in CMS with the common analysis framework project
ATLAS, CERN-IT, and CMS embarked on a project to develop a common system for analysis workflow management, resource provisioning and job scheduling. This distributed computing infrastructure was based on elements of PanDA and prior CMS workflow tools. After an extensive feasibility study and development of a proof-of-concept prototype, the project now has a basic infrastructure that supports the analysis use cases of both experiments via common services. In this paper we will discuss the state of the current solution and give an overview of all the components of the system.
DOI: 10.1088/1742-6596/513/6/062006
2014
Optimization of Italian CMS Computing Centers via MIUR funded Research Projects
In 2012, 14 Italian Institutions participating LHC Experiments (10 in CMS) have won a grant from the Italian Ministry of Research (MIUR), to optimize Analysis activities and in general the Tier2/Tier3 infrastructure. A large range of activities is actively carried on: they cover data distribution over WAN, dynamic provisioning for both scheduled and interactive processing, design and development of tools for distributed data analysis, and tests on the porting of CMS software stack to new highly performing / low power architectures.
DOI: 10.1088/1742-6596/664/3/032006
2015
Improvements of LHC data analysis techniques at Italian WLCG sites. Case-study of the transfer of this technology to other research areas
In 2012, 14 Italian institutions participating in LHC Experiments won a grant from the Italian Ministry of Research (MIUR), with the aim of optimising analysis activities, and in general the Tier2/Tier3 infrastructure. We report on the activities being researched upon, on the considerable improvement in the ease of access to resources by physicists, also those with no specific computing interests. We focused on items like distributed storage federations, access to batch-like facilities, provisioning of user interfaces on demand and cloud systems. R&D on next-generation databases, distributed analysis interfaces, and new computing architectures was also carried on. The project, ending in the first months of 2016, will produce a white paper with recommendations on best practices for data-analysis support by computing centers.
DOI: 10.1088/1742-6596/664/4/042009
2015
Architectures and methodologies for future deployment of multi-site Zettabyte-Exascale data handling platforms
Several scientific fields, including Astrophysics, Astroparticle Physics, Cosmology, Nuclear and Particle Physics, and Research with Photons, are estimating that by the 2020 decade they will require data handling systems with data volumes approaching the Zettabyte distributed amongst as many as 1018 individually addressable data objects (Zettabyte-Exascale systems). It may be convenient or necessary to deploy such systems using multiple physical sites. This paper describes the findings of a working group composed of experts from several
DOI: 10.22323/1.219.0022
2015
Computing challenges of the LHC high luminosity runs - Impact on resource needs and on computing models
Computing challenges of the LHC high luminosity runs 2 PoS(IFD2014)022
2014
Computing challenges of the LHC high luminosity runs - Impact on resource needs and on computing models
DOI: 10.48550/arxiv.1508.01443
2015
Any Data, Any Time, Anywhere: Global Data Access for Science
Data access is key to science driven by distributed high-throughput computing (DHTC), an essential technology for many major research projects such as High Energy Physics (HEP) experiments. However, achieving efficient data access becomes quite difficult when many independent storage sites are involved because users are burdened with learning the intricacies of accessing each system and keeping careful track of data location. We present an alternate approach: the Any Data, Any Time, Anywhere infrastructure. Combining several existing software products, AAA presents a global, unified view of storage systems - a "data federation," a global filesystem for software delivery, and a workflow management system. We present how one HEP experiment, the Compact Muon Solenoid (CMS), is utilizing the AAA infrastructure and some simple performance metrics.
DOI: 10.1088/1742-6596/396/4/042003
2012
Optimization of HEP Analysis Activities Using a Tier2 Infrastructure
While the model for a Tier2 is well understood and implemented within the HEP Community, a refined design for Analysis specific sites has not been agreed upon as clearly. We aim to describe the solutions adopted at the INFN Pisa, the biggest Tier2 in the Italian HEP Community. A Standard Tier2 infrastructure is optimized for Grid CPU and Storage access, while a more interactive oriented use of the resources is beneficial to the final data analysis step. In this step, POSIX file storage access is easier for the average physicist, and has to be provided in a real or emulated way. Modern analysis techniques use advanced statistical tools (like RooFit and RooStat), which can make use of multi core systems. The infrastructure has to provide or create on demand computing nodes with many cores available, above the existing and less elastic Tier2 flat CPU infrastructure. At last, the users do not want to have to deal with data placement policies at the various sites, and hence a transparent WAN file access, again with a POSIX layer, must be provided, making use of the soon-to-be-installed 10 Gbit/s regional lines. Even if standalone systems with such features are possible and exist, the implementation of an Analysis site as a virtual layer over an existing Tier2 requires novel solutions; the ones used in Pisa are described here.
DOI: 10.1088/1742-6596/396/4/042041
2012
Monitoring techniques and alarm procedures for CMS Services and Sites in WLCG
The CMS offline computing system is composed of roughly 80 sites (including most experienced T3s) and a number of central services to distribute, process and analyze data worldwide. A high level of stability and reliability is required from the underlying infrastructure and services, partially covered by local or automated monitoring and alarming systems such as Lemon and SLS; the former collects metrics from sensors installed on computing nodes and triggers alarms when values are out of range, the latter measures the quality of service and warns managers when service is affected. CMS has established computing shift procedures with personnel operating worldwide from remote Computing Centers, under the supervision of the Computing Run Coordinator at CERN. This dedicated 24/7 computing shift personnel is contributing to detect and react timely on any unexpected error and hence ensure that CMS workflows are carried out efficiently and in a sustained manner. Synergy among all the involved actors is exploited to ensure the 24/7 monitoring, alarming and troubleshooting of the CMS computing sites and services. We review the deployment of the monitoring and alarming procedures, and report on the experience gained throughout the first two years of LHC operation. We describe the efficiency of the communication tools employed, the coherent monitoring framework, the proactive alarming systems and the proficient troubleshooting procedures that helped the CMS Computing facilities and infrastructure to operate at high reliability levels.
DOI: 10.1088/1742-6596/396/3/032009
2012
Building a Prototype of LHC Analysis Oriented Computing Centers
A Consortium between four LHC Computing Centers (Bari, Milano, Pisa and Trieste) has been formed in 2010 to prototype Analysis-oriented facilities for CMS data analysis, profiting from a grant from the Italian Ministry of Research. The Consortium aims to realize an ad-hoc infrastructure to ease the analysis activities on the huge data set collected at the LHC Collider. While Tier2 Computing Centres, specialized in organized processing tasks like Monte Carlo simulation, are nowadays a well established concept, with years of running experience, site specialized towards end user chaotic analysis activities do not yet have a defacto standard implementation. In our effort, we focus on all the aspects that can make the analysis tasks easier for a physics user not expert in computing. On the storage side, we are experimenting on storage techniques allowing for remote data access and on storage optimization on the typical analysis access patterns. On the networking side, we are studying the differences between flat and tiered LAN architecture, also using virtual partitioning of the same physical networking for the different use patterns. Finally, on the user side, we are developing tools and instruments to allow for an exhaustive monitoring of their processes at the site, and for an efficient support system in case of problems. We will report about the results of the test executed on different subsystem and give a description of the layout of the infrastructure in place at the site participating to the consortium.
2017
Exploiting Apache Spark platform for CMS computing analytics
The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
DOI: 10.1088/1742-6596/898/8/082018
2017
Extending the farm on external sites: the INFN Tier-1 experience
The Tier-1 at CNAF is the main INFN computing facility offering computing and storage resources to more than 30 different scientific collaborations including the 4 experiments at the LHC. It is also foreseen a huge increase in computing needs in the following years mainly driven by the experiments at the LHC (especially starting with the run 3 from 2021) but also by other upcoming experiments such as CTA[1] While we are considering the upgrade of the infrastructure of our data center, we are also evaluating the possibility of using CPU resources available in other data centres or even leased from commercial cloud providers. Hence, at INFN Tier-1, besides participating to the EU project HNSciCloud, we have also pledged a small amount of computing resources (∼ 2000 cores) located at the Bari ReCaS[2] for the WLCG experiments for 2016 and we are testing the use of resources provided by a commercial cloud provider. While the Bari ReCaS data center is directly connected to the GARR network[3] with the obvious advantage of a low latency and high bandwidth connection, in the case of the commercial provider we rely only on the General Purpose Network. In this paper we describe the set-up phase and the first results of these installations started in the last quarter of 2015, focusing on the issues that we have had to cope with and discussing the measured results in terms of efficiency.
DOI: 10.1088/1742-6596/898/5/052033
2017
Geographically distributed Batch System as a Service: the INDIGO-DataCloud approach exploiting HTCondor
One of the challenges a scientific computing center has to face is to keep delivering well consolidated computational frameworks (i.e. the batch computing farm), while conforming to modern computing paradigms. The aim is to ease system administration at all levels (from hardware to applications) and to provide a smooth end-user experience. Within the INDIGO- DataCloud project, we adopt two different approaches to implement a PaaS-level, on-demand Batch Farm Service based on HTCondor and Mesos. In the first approach, described in this paper, the various HTCondor daemons are packaged inside pre-configured Docker images and deployed as Long Running Services through Marathon, profiting from its health checks and failover capabilities. In the second approach, we are going to implement an ad-hoc HTCondor framework for Mesos. Container-to-container communication and isolation have been addressed exploring a solution based on overlay networks (based on the Calico Project). Finally, we have studied the possibility to deploy an HTCondor cluster that spans over different sites, exploiting the Condor Connection Broker component, that allows communication across a private network boundary or firewall as in case of multi-site deployments. In this paper, we are going to describe and motivate our implementation choices and to show the results of the first tests performed.
2017
MiniSymposium on Energy Aware Scientific Computing on Low Power and Heterogeneous Architectures.
DOI: 10.1109/nssmic.2017.8533143
2017
A container-based solution to generate HTCondor Batch Systems on demand exploiting heterogeneous Clouds for data analysis
This paper describes the Dynamic On Demand Analysis Service (DODAS), an automated system that simplifies the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources, by generating clusters to run batch systems thereby implementing the "Batch System as a Service" paradigm. DODAS is built on several INDIGO-DataCloud services among which: the PaaS Orchestrator, the Infrastructure Manager, and the Identity and Access Manager are the most important. The paper describes also a successfully integration of DODAS with the computing infrastructure of the Compact Muon Solenoid (CMS) experiment installed at LHC.
2017
COmputing on SoC Architectures: the INFN COSA project
DOI: 10.48550/arxiv.1711.00552
2017
Exploiting Apache Spark platform for CMS computing analytics
The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
DOI: 10.1088/1742-6596/119/7/072015
2008
Real-time dataflow and workflow with the CMS tracker data
The Tracker detector took data with cosmics rays at the Tracker Integration Facility (TIF) at CERN. First on-line monitoring tasks were executed at the Tracker Analysis Centre (TAC) which is a dedicated Control Room at TIF with limited computing resources. A set of software agents were developed to perform the real-time data conversion in a standard format, to archive data on tape at CERN and to publish them in the official CMS data bookkeeping systems. According to the CMS computing and analysis model, most of the subsequent data processing has to be done in remote Tier-1 and Tier-2 sites, so data were automatically transferred from CERN to the sites interested to analyze them, currently Fermilab, Bari and Pisa. Official reconstruction in the distributed environment was triggered in real-time by using the tool currently used for the processing of simulated events. Automatic end-user analysis of data was performed in a distributed environment, in order to derive the distributions of important physics variables. The tracker data processing is currently migrating to the Tier-0 CERN as a prototype for the global data taking chain. Tracker data were also registered into the most recent version of the data bookkeeping system, DBS-2, by profiting from the new features to handle real data. A description of the dataflow/workflow and of the tools developed is given, together with the results about the performance of the real-time chain. Almost 7.2 million events were officially registered, moved, reconstructed and analyzed in remote sites by using the distributed environment.
DOI: 10.1088/1742-6596/110/9/092003
2008
CMS offline and computing preparation for data taking
The LHC accelerator in Geneva is expected to start colliding proton-proton beams at an energy of 14 TeV by Spring 2008. The CMS Collaboration is finalizing the construction not only of the apparatus, but also of the offline and computing infrastructure, in order to efficiently analyze the first collected data. Offline operations during the pilot run will include reconstruction at the T0, re-reconstruction with updated calibrations and alignment, sample skimming and analysis tasks. On the Computing side, this implies an efficient infrastructure able to deal with 300 Hz of DAQ output, and to move in an organized way the multi TB samples from T0 to the T2s, where analyzes will take place using GRID-like facilities. The Offline and Computing infrastructure has already been tested to a 25% scale during the challenge CSA06; a 50% challenge, CSA07, is going to happen during the summer.
DOI: 10.1016/j.nima.2022.167434
2022
High Energy Physics computing for the next decade
The next 10 years will be exciting for High Energy Physics, with new experiments entering data taking (High Luminosity LHC) or being designed and possibly approved (FCC, CEPC, ILC, MU_COLL). The computing infrastructure, including the software stacks for selection, simulation, reconstruction and analyses, will be crucial for the success of the physics programs. This contribution wants to address the landscape and the state-of-the-art in the field, highlighting the strong and weak points, and the aspects which still need sizeable R&D.
DOI: 10.1016/s0920-5632(99)00359-x
1999
Gluon splitting to bb and cc at the Z resonance
The available experimental measurements for gbb and gcc are reviewed, including some very recent results. The measurements are combined having particular care in the cross-correlations between the two quantities. The combined values can be used to rescale the central value and the error for Rb.
DOI: 10.1016/j.nima.2006.09.089
2007
First performance studies of a pixel-based trigger in the CMS experiment
An important tool for the discovery of new physics at LHC is the design of a low level trigger with an high power of background rejection. The contribution of pixel detector to the lowest level trigger at CMS is studied focusing on low-energy jet identification, matching the information from calorimeters and pixel detector. In addition, primary vertex algorithms are investigated. The performances are evaluated in terms of, respectively, QCD rejection and multihadronic jets final states efficiency.
DOI: 10.1145/3373376.3380612
2020
Current and Projected Needs for High Energy Physics Experiments (with a Particular Eye on CERN LHC)
The High Energy Physics (HEP) Experiments at Particle Colliders need complex computing infrastructures in order to extract knowledge from the large datasets collected, with over 1 Exabyte of data stored by the experiments by now. The computing needs from the top world machine, the Large Hadron Collider (LHC) at CERN/Geneva, have seeded the realisation of the large scale GRID R&D and deployment efforts during the first decade of 2000, a posteriori proven to be adequate for the LHC data processing. The upcoming upgrade of the LHC collider, called High Luminosity LHC (HL-LHC) is foreseen to require an increase in computing resources by a factor between 10x and 100x, currently expected to be beyond the scalability of the existing distributed infrastructure. Current lines of R&D are presented and discussed. With the start of big scientific endeavours with a computing complexity similar to HL-LHC (SKA, CTA, Dune, ...) they are expected to be valid for science fields outside HEP.
DOI: 10.1051/epjconf/202024503014
2020
New developments in cost modeling for the LHC computing
The increase in the scale of LHC computing during Run 3 and Run 4 (HL-LHC) will certainly require radical changes to the computing models and the data processing of the LHC experiments. The working group established by WLCG and the HEP Software Foundation to investigate all aspects of the cost of computing and how to optimise them has continued producing results and improving our understanding of this process. In particular, experiments have developed more sophisticated ways to calculate their resource needs, we have a much more detailed process to calculate infrastructure costs. This includes studies on the impact of HPC and GPU based resources on meeting the computing demands. We have also developed and perfected tools to quantitatively study the performance of experiments workloads and we are actively collaborating with other activities related to data access, benchmarking and technology cost evolution. In this contribution we expose our recent developments and results and outline the directions of future work.
DOI: 10.5281/zenodo.4062292
2020
Turning Open Science and Open Innovation into reality: ICDI Position paper on EOSC Partnership Strategic Research and Innovation Agenda
DOI: 10.48550/arxiv.1802.08640
2018
HEP Community White Paper on Software trigger and event reconstruction: Executive Summary
Realizing the physics programs of the planned and upgraded high-energy physics (HEP) experiments over the next 10 years will require the HEP community to address a number of challenges in the area of software and computing. For this reason, the HEP software community has engaged in a planning process over the past two years, with the objective of identifying and prioritizing the research and development required to enable the next generation of HEP detectors to fulfill their full physics potential. The aim is to produce a Community White Paper which will describe the community strategy and a roadmap for software and computing research and development in HEP for the 2020s. The topics of event reconstruction and software triggers were considered by a joint working group and are summarized together in this document.
DOI: 10.48550/arxiv.1802.08638
2018
HEP Community White Paper on Software trigger and event reconstruction
Realizing the physics programs of the planned and upgraded high-energy physics (HEP) experiments over the next 10 years will require the HEP community to address a number of challenges in the area of software and computing. For this reason, the HEP software community has engaged in a planning process over the past two years, with the objective of identifying and prioritizing the research and development required to enable the next generation of HEP detectors to fulfill their full physics potential. The aim is to produce a Community White Paper which will describe the community strategy and a roadmap for software and computing research and development in HEP for the 2020s. The topics of event reconstruction and software triggers were considered by a joint working group and are summarized together in this document.
DOI: 10.1109/nssmic.2005.1596421
2006
The CMS Object-Oriented Simulation
The CMS object oriented Geant4-based program is used to simulate the complete central CMS detector (over 1 million geometrical volumes) and the forward systems such as the Totem telescopes, Castor calorimeter, zero degree calorimeter, Roman pots, and the luminosity monitor. The simulation utilizes the full set of electromagnetic and hadronic physics processes provided by Geant4 and detailed particle tracking in the 4 tesla magnetic field. Electromagnetic shower parameterization can be used instead of full tracking of high-energy electrons and positrons, allowing significant gains in speed without detrimental precision losses. The simulation physics has been validated by comparisons with test beam data and previous simulation results. The system has been in production for almost two years and has delivered over 100 million events for various LHC physics channels. Productions are run on the US and EU grids at a rate of 3-5 million events per month. At the same time, the simulation has evolved to fulfill emerging requirements for new physics simulations, including very large heavy ion events and a variety of SUSY scenarios. The software has also undergone major technical upgrades. The framework and core services have been ported to the new CMS offline software architecture and event data model. In parallel, the program is subjected to ever more stringent quality assurance procedures, including a recently commissioned automated physics validation suite
DOI: 10.1063/1.2125668
2005
Data Analysis Techniques at LHC
A review of the recent developments on data analysis techniques for the upcoming LHC experiments is presented, with the description of early tests (“Data Challenges”), which are being performed before the start‐up, to validate the overall design.
DOI: 10.1142/9789812701961_0010
2005
HIGGS PHYSICS WITH CMS
DOI: 10.1142/9789812791351_0006
2003
RECENT RESULTS IN HEAVY FLAVOUR PHYSICS
2003
Vertex reconstruction framework and its implementation for CMS
The class framework developed for vertex reconstruction in CMS is described. We emphasize how we proceed to develop a flexible, efficient and reliable piece of reconstruction software. We describe the decomposition of the algorithms into logical parts, the mathematical toolkit, and the way vertex reconstruction integrates into the CMS reconstruction project ORCA. We discuss the tools that we have developed for algorithm evaluation and optimization and for code release.
DOI: 10.1051/epjconf/202125102045
2021
First experiences with a portable analysis infrastructure for LHC at INFN
The challenges proposed by the HL-LHC era are not limited to the sheer amount of data to be processed: the capability of optimizing the analyser's experience will also bring important benefits for the LHC communities, in terms of total resource needs, user satisfaction and in the reduction of end time to publication. At the Italian National Institute for Nuclear Physics (INFN) a portable software stack for analysis has been proposed, based on cloud-native tools and capable of providing users with a fully integrated analysis environment for the CMS experiment. The main characterizing traits of the solution consist in the user-driven design and the portability to any cloud resource provider. All this is made possible via an evolution towards a “python-based” framework, that enables the usage of a set of open-source technologies largely adopted in both cloud-native and data-science environments. In addition, a “single sign on”-like experience is available thanks to the standards-based integration of INDIGO-IAM with all the tools. The integration of compute resources is done through the customization of a JupyterHUB solution, able to spawn identity-aware user instances ready to access data with no further setup actions. The integration with GPU resources is also available, designed to sustain more and more widespread ML based workflow. Seamless connections between the user UI and batch/big data processing framework (Spark, HTCondor) are possible. Eventually, the experiment data access latency is reduced thanks to the integrated deployment of a scalable set of caches, as developed in the context of ESCAPE project, and as such compatible with the future scenarios where a data-lake will be available for the research community. The outcome of the evaluation of such a solution in action is presented, showing how a real CMS analysis workflow can make use of the infrastructure to achieve its results.
DOI: 10.22323/1.378.0003
2021
Enabling HPC systems for HEP: the INFN-CINECA Experience
In this report we want to describe a successful integration exercise between CINECA (PRACE Tier-0) Marconi KNL system and LHC processing. A production-level system has been deployed using a 30 Mhours grant from the 18th Call for PRACE Project Access; thanks to CINECA, more than 3x the granted hours were eventually made available. Modifications at multiple levels were needed: on experiments' WMS layers, on site level access policies and routing, on virtualization. The success of the integration process paves the way to integration with additional local systems, and in general shows how the requirements of a HPC center can coexist with the needs from data intensive, complex distributed workflows.
2021
Dynamic Distribution of High-Rate Data Processing from CERN to Remote HPC Data Centers
DOI: 10.1016/s0920-5632(00)01062-8
2001
b quark fragmentation functions at the Z peak
The latest measurements about b quark fragmentation functions are reviewed, taking into account the differences in methods and in what is actually measured.
1999
Electroweak heavy flavour results prensented at the 1999 winter conferences