ϟ

D. Spiga

Here are all the papers by D. Spiga that you can download and read on OA.mg.
D. Spiga’s last known institution is . Download D. Spiga PDFs here.

Claim this Profile →

DOI: 10.1007/s10723-018-9453-3

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures

This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applications and cloud framework enhancements to manage the demanding requirements of scientific communities, either locally or through enhanced interfaces. The middleware developed allows to federate hybrid resources, to easily write, port and run scientific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solutions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our developments are freely downloadable as open source components, and are already being integrated into many scientific applications.

DOI: 10.1007/s11869-023-01495-x

Air quality changes during the COVID-19 pandemic guided by robust virus-spreading data in Italy

DOI: 10.1007/978-3-540-77220-0_52

The CMS Remote Analysis Builder (CRAB)

DOI: 10.1088/1742-6596/2438/1/012039

Extending the distributed computing infrastructure of the CMS experiment with HPC resources

Abstract Particle accelerators are an important tool to study the fundamental properties of elementary particles. Currently the highest energy accelerator is the LHC at CERN, in Geneva, Switzerland. Each of its four major detectors, such as the CMS detector, produces dozens of Petabytes of data per year to be analyzed by a large international collaboration. The processing is carried out on the Worldwide LHC Computing Grid, that spans over more than 170 compute centers around the world and is used by a number of particle physics experiments. Recently the LHC experiments were encouraged to make increasing use of HPC resources. While Grid resources are homogeneous with respect to the used Grid middleware, HPC installations can be very different in their setup. In order to integrate HPC resources into the highly automatized processing setups of the CMS experiment a number of challenges need to be addressed. For processing, access to primary data and metadata as well as access to the software is required. At Grid sites all this is achieved via a number of services that are provided by each center. However at HPC sites many of these capabilities cannot be easily provided and have to be enabled in the user space or enabled by other means. At HPC centers there are often restrictions regarding network access to remote services, which is again a severe limitation. The paper discusses a number of solutions and recent experiences by the CMS experiment to include HPC resources in processing campaigns.

DOI: 10.1016/j.cpc.2023.108965

Prototyping a ROOT-based distributed analysis workflow for HL-LHC: The CMS use case

The challenges expected for the next era of the Large Hadron Collider (LHC), both in terms of storage and computing resources, provide LHC experiments with a strong motivation for evaluating ways of rethinking their computing models at many levels. Great efforts have been put into optimizing the computing resource utilization for the data analysis, which leads both to lower hardware requirements and faster turnaround for physics analyses. In this scenario, the Compact Muon Solenoid (CMS) collaboration is involved in several activities aimed at benchmarking different solutions for running High Energy Physics (HEP) analysis workflows. A promising solution is evolving software towards more user-friendly approaches featuring a declarative programming model and interactive workflows. The computing infrastructure should keep up with this trend by offering on the one side modern interfaces, and on the other side hiding the complexity of the underlying environment, while efficiently leveraging the already deployed grid infrastructure and scaling toward opportunistic resources like public cloud or HPC centers. This article presents the first example of using the ROOT RDataFrame technology to exploit such next-generation approaches for a production-grade CMS physics analysis. A new analysis facility is created to offer users a modern interactive web interface based on JupyterLab that can leverage HTCondor-based grid resources on different geographical sites. The physics analysis is converted from a legacy iterative approach to the modern declarative approach offered by RDataFrame and distributed over multiple computing nodes. The new scenario offers not only an overall improved programming experience, but also an order of magnitude speedup increase with respect to the previous approach.

DOI: 10.48550/arxiv.2404.02100

Analysis Facilities White Paper

This white paper presents the current status of the R&D for Analysis Facilities (AFs) and attempts to summarize the views on the future direction of these facilities. These views have been collected through the High Energy Physics (HEP) Software Foundation's (HSF) Analysis Facilities forum, established in March 2022, the Analysis Ecosystems II workshop, that took place in May 2022, and the WLCG/HSF pre-CHEP workshop, that took place in May 2023. The paper attempts to cover all the aspects of an analysis facility.

DOI: 10.1051/epjconf/202429508013

ML_INFN project: Status report and future perspectives

The ML_INFN initiative (“ Machine Learning at INFN” ) is an effort to foster Machine Learning (ML) activities at the Italian National Institute for Nuclear Physics (INFN). In recent years, artificial intelligence inspired activities have flourished bottom-up in many efforts in Physics, both at the experimental and theoretical level. Many researchers have procured desktop-level devices, with consumer-oriented GPUs, and have trained themselves in a variety of ways, from webinars, books, and tutorials. ML_INFN aims to help and systematize such effort, in multiple ways: by offering state-of-the-art hardware for ML, leveraging on the INFN Cloud provisioning solutions and thus sharing more efficiently GPUs and leveling the access to such resources to all INFN researchers, and by organizing and curating Knowledge Bases with productiongrade examples from successful activities already in production. Moreover, training events have been organized for beginners, based on existing INFN ML research and focused on flattening the learning curve. In this contribution, we will update the status of the project reporting in particular on the development of tools to take advantage of High-Performance Computing resources provisioned by CNAF and ReCaS computing centers for interactive support to activities and on the organization of the first in-person advanced-level training event, with a GPU-equipped cloud-based environment provided to each participant.

DOI: 10.1051/epjconf/202429507040

Progress on cloud native solution of Machine Learning as a Service for HEP

Nowadays Machine Learning (ML) techniques are successfully used in many areas of High-Energy Physics (HEP) and will play a significant role also in the upcoming High-Luminosity LHC upgrade foreseen at CERN, when a huge amount of data will be produced by LHC and collected by the experiments, facing challenges at the exascale. To favor the usage of ML in HEP analyses, it would be useful to have a service allowing to perform the entire ML pipeline (in terms of reading the data, processing data, training a ML model, and serving predictions) directly using ROOT files of arbitrary size from local or remote distributed data sources. The Machine Learning as a Service for HEP (MLaaS4HEP) solution we have already proposed aims to provide such kind of service and to be HEP experiment agnostic. To provide users with a real service and to integrate it into the INFN Cloud, we started working on MLaaS4HEP cloudification. This would allow to use cloud resources and to work in a distributed environment. In this work, we provide updates on this topic and discuss a working prototype of the service running on INFN Cloud. It includes an OAuth2 proxy server as authentication/authorization layer, a MLaaS4HEP server, an XRootD proxy server for enabling access to remote ROOT data, and the TensorFlow as a Service (TFaaS) service in charge of the inference phase. With this architecture a HEP user can submit ML pipelines, after being authenticated and authorized, using local or remote ROOT files simply using HTTP calls.

DOI: 10.1051/epjconf/202429503036

Repurposing of the Run 2 CMS High Level Trigger Infrastructure as a Cloud Resource for Offline Computing

The former CMS Run 2 High Level Trigger (HLT) farm is one of the largest contributors to CMS compute resources, providing about 25k job slots for offline computing. This CPU farm was initially employed as an opportunistic resource, exploited during inter-fill periods, in the LHC Run 2. Since then, it has become a nearly transparent extension of the CMS capacity at CERN, being located on-site at the LHC interaction point 5 (P5), where the CMS detector is installed. This resource has been configured to support the execution of critical CMS tasks, such as prompt detector data reconstruction. It can therefore be used in combination with the dedicated Tier 0 capacity at CERN, in order to process and absorb peaks in the stream of data coming from the CMS detector. The initial configuration for this resource, based on statically configured VMs, provided the required level of functionality. However, regular operations of this cluster revealed certain limitations compared to the resource provisioning and use model employed in the case of WLCG sites. A new configuration, based on a vacuum-like model, has been implemented for this resource in order to solve the detected shortcomings. This paper reports about this redeployment work on the permanent cloud for an enhanced support to CMS offline computing, comparing the former and new models’ respective functionalities, along with the commissioning effort for the new setup.

DOI: 10.1051/epjconf/202429511012

KServe inference extension for an FPGA vendor-free ecosystem

Field Programmable Gate Arrays (FPGAs) are playing an increasingly important role in the sampling and data processing industry due to their intrinsically highly parallel architecture, low power consumption, and flexibility to execute custom algorithms. In particular, the use of FPGAs to perform Machine Learning (ML) inference is increasingly growing thanks to the development of High-Level Synthesis (HLS) projects that abstract the complexity of Hardware Description Language (HDL) programming. In this work we will describe our experience extending KServe predictors, an emerging standard for ML model inference as a service on kubernetes. This project will support a custom workflow capable of loading and serving models on-demand on top of FPGAs. A key aspect of the proposed approach is to make the firmware generation, often an obstacle to a widespread FPGA adoption, transparent. We will detail how the proposed system automates both the synthesis of the HDL code and the generation of the firmware, starting from a high-level language and user-friendly machine learning libraries. The ecosystem is then completed with the adoption of a common language for sharing user models and firmwares, that is based on a dedicated Open Container Initiative artifact definition, thus leveraging all the well established practices on managing resources on a container registry.

DOI: 10.1051/epjconf/202429510004

INFN and the evolution of distributed scientific computing in Italy

INFN has been running a distributed infrastructure (the Tier-1 at Bologna-CNAF and 9 Tier-2 centres) for more than 20 years which currently offers about 150000 CPU cores and 120 PB of space both in tape and disk storage, serving more than 40 international scientific collaborations. This Grid-based infrastructure was augmented in 2019 with the INFN Cloud: a production quality multi-site federated Cloud infrastructure, composed by a core backbone, and which is able to integrate other INFN sites and public or private Clouds as well. The INFN Cloud provides a customizable and extensible portfolio offering computing and storage services spanning the IaaS, PaaS and SaaS layers, with dedicated solutions to serve special purposes, such as ISO-certified regions for the handling of sensitive data. INFN is now revising and expanding its infrastructure to tackle the challenges expected in the next 10 years of scientific computing adopting a “cloud-first” approach, through which all the INFN data centres will be federated via the INFN Cloud middleware and integrated with key HPC centres, such as the pre-exascale Leonardo machine at CINECA. In such a process, which involves both the infrastructures and the higher level services, initiatives and projects such as the "Italian National Centre on HPC, Big Data and Quantum Computing" (funded in the context of the Italian "National Recovery and Resilience Plan") and the Bologna Technopole are precious opportunities that will be exploited to offer advanced resources and services to universities, research institutions and industry. In this paper we describe how INFN is evolving its computing infrastructure, with the ambition to create and operate a national vendorneutral, open, scalable, and flexible "datalake" able to serve much more than just INFN users and experiments.

DOI: 10.1051/epjconf/202429511006

Enabling INFN–T1 to support heterogeneous computing architectures

The INFN–CNAF Tier-1 located in Bologna (Italy) is a center of the WLCG e-Infrastructure providing computing power to the four major LHC collaborations and also supports the computing needs of about fifty more groups - also from non HEP research domains. The CNAF Tier1 center has been historically very active putting effort in the integration of computing resources, proposing and prototyping solutions both for extension through Cloud resources, public and private, and with remotely owned sites, as well as developing an integrated HTC+HPC system with the PRACE CINECA supercomputer center located 8Km far from the CNAF Tier-1 located in Bologna. In order to meet the requirements for the new Tecnopolo center, where the CNAF Tier-1 will be hosted, the resource integration activities keep progressing. In particular, this contribution will detail the challenges that have recently been addressed, providing opportunistic access to non standard CPU architectures, such as PowerPC and hardware accelerators (GPUs). We explain the approach adopted to both transparently provision x86_64, ppc64le and NVIDIA V100 GPUs from the Marconi 100 HPC cluster managed by CINECA and to access data from the Tier1 storage system at CNAF. The solution adopted is general enough to enable seamless integration of other computing architectures at the same time from different providers, such as ARM CPUs from the TEXTAROSSA project, and we report about the integration of these within the computing model of the CMS experiment. Finally we will discuss the results of the early experience.

DOI: 10.1109/tns.2009.2028076

CRAB: A CMS Application for Distributed Analysis

Beginning in 2009, the CMS experiment will produce several petabytes of data each year which will be distributed over many computing centres geographically distributed in different countries. The CMS computing model defines how the data is to be distributed and accessed to enable physicists to efficiently run their analyses over the data. The analysis will be performed in a distributed way using Grid infrastructure. CRAB (CMS remote analysis builder) is a specific tool, designed and developed by the CMS collaboration, that allows the end user to transparently access distributed data. CRAB interacts with the local user environment, the CMS data management services and with the Grid middleware; it takes care of the data and resource discovery; it splits the user's task into several processes (jobs) and distributes and parallelizes them over different Grid environments; it performs process tracking and output handling. Very limited knowledge of the underlying technical details is required of the end user. The tool can be used as a direct interface to the computing system or can delegate the task to a server, which takes care of the job handling, providing services such as automatic resubmission in case of failures and notification to the user of the task status. Its current implementation is able to interact with gLite and OSG Grid middlewares. Furthermore, with the same interface, it enables access to local data and batch systems such as load sharing facility (LSF). CRAB has been in production and in routine use by end users since Spring 2004. It has been extensively used in studies to prepare the Physics Technical Design Report, in the analysis of reconstructed event samples generated during the Computing Software and Analysis Challenges and in the preliminary cosmic ray data taking. The CRAB architecture and the usage inside the CMS community will be described in detail, as well as the current status and future development.

DOI: 10.1088/1742-6596/396/3/032026

CRAB3: Establishing a new generation of services for distributed analysis at CMS

In CMS Computing the highest priorities for analysis tools are the improvement of the end users' ability to produce and publish reliable samples and analysis results as well as a transition to a sustainable development and operations model. To achieve these goals CMS decided to incorporate analysis processing into the same framework as data and simulation processing. This strategy foresees that all workload tools (TierO, Tier1, production, analysis) share a common core with long term maintainability as well as the standardization of the operator interfaces. The re-engineered analysis workload manager, called CRAB3, makes use of newer technologies, such as RESTFul based web services and NoSQL Databases, aiming to increase the scalability and reliability of the system. As opposed to CRAB2, in CRAB3 all work is centrally injected and managed in a global queue. A pool of agents, which can be geographically distributed, consumes work from the central services serving the user tasks. The new architecture of CRAB substantially changes the deployment model and operations activities. In this paper we present the implementation of CRAB3, emphasizing how the new architecture improves the workflow automation and simplifies maintainability. In particular, we will highlight the impact of the new design on daily operations.

DOI: 10.1016/j.nuclphysbps.2007.11.124

CRAB: the CMS distributed analysis tool development and design

Starting from 2007 the CMS experiment will produce several Pbytes of data each year, to be distributed over many computing centers located in many different countries. The CMS computing model defines how the data are to be distributed such that CMS physicists can access them in an efficient manner in order to perform their physics analysis. CRAB (CMS Remote Analysis Builder) is a specific tool, designed and developed by the CMS collaboration, that facilitates access to the distributed data in a very transparent way. The tool's main feature is the possibility of distributing and parallelizing the local CMS batch data analysis processes over different Grid environments without any specific knowledge of the underlying computational infrastructures. More specifically CRAB allows the transparent usage of WLCG, gLite and OSG middleware. CRAB interacts with both the local user environment, with CMS Data Management services and with the Grid middleware.

DOI: 10.1007/s10723-010-9152-1

Distributed Analysis in CMS

The CMS experiment expects to manage several Pbytes of data each year during the LHC programme, distributing them over many computing sites around the world and enabling data access at those centers for analysis. CMS has identified the distributed sites as the primary location for physics analysis to support a wide community with thousands potential users. This represents an unprecedented experimental challenge in terms of the scale of distributed computing resources and number of user. An overview of the computing architecture, the software tools and the distributed infrastructure is reported. Summaries of the experience in establishing efficient and scalable operations to get prepared for CMS distributed analysis are presented, followed by the user experience in their current analysis activities.

DOI: 10.1088/1742-6596/396/3/032047

Implementing data placement strategies for the CMS experiment based on a popularity model

During the first two years of data taking, the CMS experiment has collected over 20 PetaBytes of data and processed and analyzed it on the distributed, multi-tiered computing infrastructure on the WorldWide LHC Computing Grid. Given the increasing data volume that has to be stored and efficiently analyzed, it is a challenge for several LHC experiments to optimize and automate the data placement strategies in order to fully profit of the available network and storage resources and to facilitate daily computing operations. Building on previous experience acquired by ATLAS, we have developed the CMS Popularity Service that tracks file accesses and user activity on the grid and will serve as the foundation for the evolution of their data placement. A fully automated, popularity-based site-cleaning agent has been deployed in order to scan Tier-2 sites that are reaching their space quota and suggest obsolete, unused data that can be safely deleted without disrupting analysis activity. Future work will be to demonstrate dynamic data placement functionality based on this popularity service and integrate it in the data and workload management systems: as a consequence the pre-placement of data will be minimized and additional replication of hot datasets will be requested automatically. This paper will give an insight into the development, validation and production process and will analyze how the framework has influenced resource optimization and daily operations in CMS.

DOI: 10.1016/j.ejmp.2021.10.005

Enhancing the impact of Artificial Intelligence in Medicine: A joint AIFM-INFN Italian initiative for a dedicated cloud-based computing infrastructure

Artificial Intelligence (AI) techniques have been implemented in the field of Medical Imaging for more than forty years.Medical Physicists, Clinicians and Computer Scientists have been collaborating since the beginning to realize software solutions to enhance the informative content of medical images, including AI-based support systems for image interpretation.Despite the recent massive progress in this field due to the current emphasis on Radiomics, Machine Learning and Deep Learning, there are still some barriers to overcome before these tools are fully integrated into the clinical workflows to finally enable a precision medicine approach to patients' care.Nowadays, as Medical Imaging has entered the Big Data era, innovative solutions to efficiently deal with huge amounts of data and to exploit large and distributed computing resources are urgently needed.In the framework of a collaboration agreement between the Italian Association of Medical Physicists (AIFM) and the National Institute for Nuclear Physics (INFN), we propose a model of an intensive computing infrastructure, especially suited for training AI models, equipped with secure storage systems, compliant with data protection regulation, which will accelerate the development and extensive validation of AI-based solutions in the Medical Imaging field of research.This solution can be developed and made operational by Physicists and Computer Scientists working on complementary fields of research in Physics, such as High Energy Physics and Medical Physics, who have all the necessary skills to tailor the AI-technology to the needs of the Medical Imaging community and to shorten the pathway towards the clinical applicability of AI-based decision support systems.

DOI: 10.1051/epjconf/201921407027

Exploiting private and commercial clouds to generate on-demand CMS computing facilities with DODAS

Minimising time and cost is key to exploit private or commercial clouds. This can be achieved by increasing setup and operational efficiencies. The success and sustainability are thus obtained reducing the learning curve, as well as the operational cost of managing community-specific services running on distributed environments. The greater beneficiaries of this approach are communities willing to exploit opportunistic cloud resources. DODAS builds on several EOSC-hub services developed by the INDIGO-DataCloud project and allows to instantiate on-demand container-based clusters. These execute software applications to benefit of potentially “any cloud provider”, generating sites on demand with almost zero effort. DODAS provides ready-to-use solutions to implement a “Batch System as a Service” as well as a BigData platform for a “Machine Learning as a Service”, offering a high level of customization to integrate specific scenarios. A description of the DODAS architecture will be given, including the CMS integration strategy adopted to connect it with the experiment’s HTCondor Global Pool. Performance and scalability results of DODAS-generated tiers processing real CMS analysis jobs will be presented. The Instituto de Física de Cantabria and Imperial College London use cases will be sketched. Finally a high level strategy overview for optimizing data ingestion in DODAS will be described.

DOI: 10.1088/1742-6596/219/7/072013

Use of glide-ins in CMS for production and analysis

With the evolution of various grid federations, the Condor glide-ins represent a key feature in providing a homogeneous pool of resources using late-binding technology. The CMS collaboration uses the glide-in based Workload Management System, glideinWMS, for production (ProdAgent) and distributed analysis (CRAB) of the data. The Condor glide-in daemons traverse to the worker nodes, submitted via Condor-G. Once activated, they preserve the Master-Worker relationships, with the worker first validating the execution environment on the worker node before pulling the jobs sequentially until the expiry of their lifetimes. The combination of late-binding and validation significantly reduces the overall failure rate visible to CMS physicists. We discuss the extensive use of the glideinWMS since the computing challenge, CCRC-08, in order to prepare for the forthcoming LHC data-taking period. The key features essential to the success of large-scale production and analysis on CMS resources across major grid federations, including EGEE, OSG and NorduGrid are outlined. Use of glide-ins via the CRAB server mechanism and ProdAgent, as well as first hand experience of using the next generation CREAM computing element within the CMS framework is discussed.

DOI: 10.1088/1742-6596/219/5/052022

The CMS CERN Analysis Facility (CAF)

The CMS CERN Analysis Facility (CAF) was primarily designed to host a large variety of latency-critical workflows. These break down into alignment and calibration, detector commissioning and diagnosis, and high-interest physics analysis requiring fast-turnaround. In addition to the low latency requirement on the batch farm, another mandatory condition is the efficient access to the RAW detector data stored at the CERN Tier-0 facility. The CMS CAF also foresees resources for interactive login by a large number of CMS collaborators located at CERN, as an entry point for their day-by-day analysis. These resources will run on a separate partition in order to protect the high-priority use-cases described above. While the CMS CAF represents only a modest fraction of the overall CMS resources on the WLCG GRID, an appropriately sized user-support service needs to be provided. We will describe the building, commissioning and operation of the CMS CAF during the year 2008. The facility was heavily and routinely used by almost 250 users during multiple commissioning and data challenge periods. It reached a CPU capacity of 1.4MSI2K and a disk capacity at the Peta byte scale. In particular, we will focus on the performances in terms of networking, disk access and job efficiency and extrapolate prospects for the upcoming LHC first year data taking. We will also present the experience gained and the limitations observed in operating such a large facility, in which well controlled workflows are combined with more chaotic type analysis by a large number of physicists.

DOI: 10.1088/1742-6596/219/6/062007

Use of the gLite-WMS in CMS for production and analysis

The CMS experiment at LHC started using the Resource Broker (by the EDG and LCG projects) to submit Monte Carlo production and analysis jobs to distributed computing resources of the WLCG infrastructure over 6 years ago. Since 2006 the gLite Workload Management System (WMS) and Logging & Bookkeeping (LB) are used. The interaction with the gLite-WMS/LB happens through the CMS production and analysis frameworks, respectively ProdAgent and CRAB, through a common component, BOSSLite. The important improvements recently made in the gLite-WMS/LB as well as in the CMS tools and the intrinsic independence of different WMS/LB instances allow CMS to reach the stability and scalability needed for LHC operations. In particular the use of a multi-threaded approach in BOSSLite allowed to increase the scalability of the systems significantly. In this work we present the operational set up of CMS production and analysis based on the gLite-WMS and the performances obtained in the past data challenges and in the daily Monte Carlo productions and user analysis usage in the experiment.

DOI: 10.1088/1742-6596/219/7/072007

CMS analysis operations

During normal data taking CMS expects to support potentially as many as 2000 analysis users. Since the beginning of 2008 there have been more than 800 individuals who submitted a remote analysis job to the CMS computing infrastructure. The bulk of these users will be supported at the over 40 CMS Tier-2 centres. Supporting a globally distributed community of users on a globally distributed set of computing clusters is a task that requires reconsidering the normal methods of user support for Analysis Operations. In 2008 CMS formed an Analysis Support Task Force in preparation for large-scale physics analysis activities. The charge of the task force was to evaluate the available support tools, the user support techniques, and the direct feedback of users with the goal of improving the success rate and user experience when utilizing the distributed computing environment. The task force determined the tools needed to assess and reduce the number of non-zero exit code applications submitted through the grid interfaces and worked with the CMS experiment dashboard developers to obtain the necessary information to quickly and proactively identify issues with user jobs and data sets hosted at various sites. Results of the analysis group surveys were compiled. Reference platforms for testing and debugging problems were established in various geographic regions. The task force also assessed the resources needed to make the transition to a permanent Analysis Operations task. In this presentation the results of the task force will be discussed as well as the CMS Analysis Operations plans for the start of data taking.

DOI: 10.1088/1742-6596/219/7/072019

Automation of user analysis workflow in CMS

CMS has a distributed computing model, based on a hierarchy of tiered regional computing centres. However, the end physicist is not interested in the details of the computing model nor the complexity of the underlying infrastructure, but only to access and use efficiently and easily the remote services. The CMS Remote Analysis Builder (CRAB) is the official CMS tool that allows the access to the distributed data in a transparent way. We present the current development direction, which is focused on improving the interface presented to the user and adding intelligence to CRAB such that it can be used to automate more and more the work done on behalf of user. We also present the status of deployment of the CRAB system and the lessons learnt in deploying this tool to the CMS collaboration.

DOI: 10.1088/1742-6596/396/4/042050

Using Hadoop File System and MapReduce in a small/medium Grid site

Data storage and data access represent the key of CPU-intensive and data-intensive high performance Grid computing. Hadoop is an open-source data processing framework that includes fault-tolerant and scalable distributed data processing model and execution environment, named MapReduce, and distributed File System, named Hadoop distributed File System (HDFS).

DOI: 10.1088/1742-6596/219/4/042024

Job life cycle management libraries for CMS workflow management projects

Scientific analysis and simulation requires the processing and generation of millions of data samples. These tasks are often comprised of multiple smaller tasks divided over multiple (computing) sites. This paper discusses the Compact Muon Solenoid (CMS) workflow infrastructure, and specifically the Python based workflow library which is used for so called task lifecycle management. The CMS workflow infrastructure consists of three layers: high level specification of the various tasks based on input/output data sets, life cycle management of task instances derived from the high level specification and execution management. The workflow library is the result of a convergence of three CMS sub projects that respectively deal with scientific analysis, simulation and real time data aggregation from the experiment. This will reduce duplication and hence development and maintenance costs.

DOI: 10.1088/1742-6596/396/3/032113

The CMS workload management system

CMS has started the process of rolling out a new workload management system. This system is currently used for reprocessing and Monte Carlo production with tests under way using it for user analysis.

DOI: 10.1051/epjconf/202024504024

Smart Caching at CMS: applying AI to XCache edge services

The projected Storage and Compute needs for the HL-LHC will be a factor up to 10 above what can be achieved by the evolution of current technology within a flat budget. The WLCG community is studying possible technical solutions to evolve the current computing in order to cope with the requirements; one of the main focus is resource optimization, with the ultimate aim of improving performance and efficiency, as well as simplifying and reducing operation costs. As of today the storage consolidation based on a Data Lake model is considered a good candidate for addressing HL-LHC data access challenges. The Data Lake model under evaluation can be seen as a logical system that hosts a distributed working set of analysis data. Compute power can be “close” to the lake, but also remote and thus completely external. In this context we expect data caching to play a central role as a technical solution to reduce the impact of latency and reduce network load. A geographically distributed caching layer will be functional to many satellite computing centers that might appear and disappear dynamically. In this talk we propose a system of caches, distributed at national level, describing both deployment and results of the studies made to measure the impact on the CPU efficiency. In this contribution, we also present the early results on novel caching strategy beyond the standard XRootD approach whose results will be a baseline for an AI-based smart caching system.

DOI: 10.1051/epjconf/202024509009

Extension of the INFN Tier-1 on a HPC system

The INFN Tier-1 located at CNAF in Bologna (Italy) is a center of the WLCG e-Infrastructure, supporting the 4 major LHC collaborations and more than 30 other INFN-related experiments. After multiple tests towards elastic expansion of CNAF compute power via Cloud resources (provided by Azure, Aruba and in the framework of the HNSciCloud project), and building on the experience gained with the production quality extension of the Tier-1 farm on remote owned sites, the CNAF team, in collaboration with experts from the ALICE, ATLAS, CMS, and LHCb experiments, has been working to put in production a solution of an integrated HTC+HPC system with the PRACE CINECA center, located nearby Bologna. Such extension will be implemented on the Marconi A2 partition, equipped with Intel Knights Landing (KNL) processors. A number of technical challenges were faced and solved in order to successfully run on low RAM nodes, as well as to overcome the closed environment (network, access, software distribution, ... ) that HPC systems deploy with respect to standard GRID sites. We show preliminary results from a large scale integration effort, using resources secured via the successful PRACE grant N. 2018194658, for 30 million KNL core hours.

DOI: 10.1088/1742-6596/664/2/022017

CMS@home: Enabling Volunteer Computing Usage for CMS

Volunteer computing remains a largely untapped opportunistic resource for the LHC experiments. The use of virtualization in this domain was pioneered by the Test4Theory project and enabled the running of high energy particle physics simulations on home computers. This paper describes the model for CMS to run workloads using a similar volunteer computing platform. It is shown how the original approach is exploited to map onto the existing CMS workflow and identifies missing functionality along with the components and changes that are required. The final implementation of the prototype is detailed along with the identification of areas that would benefit from further development.

DOI: 10.1016/j.nuclphysbps.2007.08.061

CMS workload management

From september 2007 the LHC accelerator will start its activity and CMS, one of the four experiments, will begin to take data. The CMS computing model is based on the the Grid paradigm where data is deployed and accessed on a number of geographically distributed computing centers. In addition to real data events, a large number of simulated ones will be produced in a similar, distributed manner. Both real and simulated data will be analyzed by physicist, at an expected rate of 100000 jobs per day submitted to the Grid infrastructure. In order to reach these goals, CMS is developing two tools for the workload management (plus a set of services): ProdAgent and CRAB. The ProdAgent deals with MonteCarlo production system: it creates and configures jobs, interacts with the Framework, merges outputs to a reasonable filesize and publishes the simulated data back into CMS data bookkeeping and data location services. CRAB (Cms Remote Analysis Builder) is the tool deployed ad hoc by CMS to access those remote data. CRAB allows a generic user, without specific knowledge of the Grid infrastructure, to access data and perform its analysis as simply as in a local environment. CRAB takes care to interact with all Data Management services, from data discovery and location to output file management. An overview of the current implementation of the components of the CMS workload management is presented in this work.

Track Reconstruction with Cosmic Ray Data at the Tracker Integration Facility

The subsystems of the CMS silicon strip tracker were integrated and commissioned at the Tracker Integration Facility (TIF) in the period from November 2006 to July 2007. As part of the commissioning, large samples of cosmic ray data were recorded under various running conditions in the absence of a magnetic field. Cosmic rays detected by scintillation counters were used to trigger the readout of up to 15\,\% of the final silicon strip detector, and over 4.7~million events were recorded. This document describes the cosmic track reconstruction and presents results on the performance of track and hit reconstruction as from dedicated analyses.

DOI: 10.1088/1742-6596/331/7/072030

Large scale and low latency analysis facilities for the CMS experiment: development and operational aspects

While a majority of CMS data analysis activities rely on the distributed computing infrastructure on the WLCG Grid, dedicated local computing facilities have been deployed to address particular requirements in terms of latency and scale.

DOI: 10.1109/cts.2014.6867614

A cloud-based solution for public administrations: The experience of the Regione Marche

Cloud computing is perceived as the next wave of ICT, and many real experiences are on the commercial scene. However this kind of architecture has open legal issues which makes it an endeavor for Public Administrations, despite its potential impact on the efficiency, effectiveness and transparency of administrative initiatives. In the present paper we present the experience made in the deployment of a Cloud solution in the Regione Marche Local Public Administration, which represents one of the pilot experiences at Nationallevel.

DOI: 10.22323/1.162.0107

Optimizing the usage of multi-Petabyte storage resources for LHC experiments

In the last two years of Large Hadron Collider (LHC) [1] operation, the experiments have made a considerable usage of Grid resources for the data storage and offline analysis.To achieve the successful exploitation of these resources a significant operational human effort has been put in place and it is the moment to improve the usage of the available infrastructure.In this respect, the Compact Muon Solenoid (CMS) [2] Popularity project aims to track the experiment's data access patterns (frequency of data access, access protocols, users, sites and CPU), providing the base for the automation of data cleaning and data placement activity on Grid sites.As well, the popularity-based Site Cleaning Agent (Victor) has been developed to monitor the evolution in time of the used and pledged space and remove unused data replicas at full Tier-2s.

DOI: 10.1007/s41781-018-0006-z

CMS@home: Integrating the Volunteer Cloud and High-Throughput Computing

Volunteer computing has the potential to provide significant additional computing capacity for the LHC experiments. Initiatives such as the CMS@home project are aiming to integrate volunteer computing resources into the experiment’s computational frameworks to support their scientific workloads. This is especially important, as over the next few years the demands on computing capacity will increase beyond what can be supported by general technology trends. This paper describes how a volunteer computing project that uses virtualization to run high energy physics simulations can integrate those resources into their computing infrastructure. The concept of the volunteer cloud is introduced and how this model can simplify the integration is described. An architecture for implementing the volunteer cloud model is presented along with an implementation for the CMS@home project. Finally, the submission of real CMS workloads to this volunteer cloud are compared to identical workloads submitted to the grid.

DOI: 10.22323/1.327.0024

DODAS: How to effectively exploit heterogeneous clouds for scientific computations

Dynamic On Demand Analysis Service (DODAS) is a Platform as a Service tool built combining several solutions and products developed by the INDIGO-DataCloud H2020 project.DODAS allows to instantiate on-demand container-based clusters.Both HTCondor batch system and platform for the Big Data analysis based on Spark, Hadoop etc, can be deployed on any cloud-based infrastructures with almost zero effort.DODAS acts as cloud enabler designed for scientists seeking to easily exploit distributed and heterogeneous clouds to process data.Aiming to reduce the learning curve as well as the operational cost of managing community specific services running on distributed cloud, DODAS completely automates the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources.DODAS was selected as one of the Thematic Services that will provide multidisciplinary solutions in the EOSC-hub project, an integration and management system of the European Open Science Cloud starting in January 2018.The main goals of this contribution are to provide a comprehensive overview of the overall technical implementation of DODAS, as well as to illustrate two distinct real examples of usage: the integration within the CMS Workload Management System and the extension of the AMS computing model.

DOI: 10.1088/1742-6596/1525/1/012057

Using DODAS as deployment manager for smart caching of CMS data management system

Abstract DODAS stands for Dynamic On Demand Analysis Service and is a Platform as a Service toolkit built around several EOSC-hub services designed to instantiate and configure on-demand container-based clusters over public or private Cloud resources. It automates the whole workflow from service provisioning to the configuration and setup of software applications. Therefore, such a solution allows using “any cloud provider”, with almost zero effort. In this paper, we demonstrate how DODAS can be adopted as a deployment manager to set up and manage the compute resources and services required to develop an AI solution for smart data caching. The smart caching layer may reduce the operational cost and increase flexibility with respect to regular centrally managed storage of the current CMS computing model. The cache space should be dynamically populated with the most requested data. In addition, clustering such caching systems will allow to operate them as a Content Delivery System between data providers and end-users. Moreover, a geographically distributed caching layer will be functional also to a data-lake based model, where many satellite computing centers might appear and disappear dynamically. In this context, our strategy is to develop a flexible and automated AI environment for smart management of the content of such clustered cache system. In this contribution, we will describe the identified computational phases required for the AI environment implementation, as well as the related DODAS integration. Therefore we will start with the overview of the architecture for the pre-processing step, based on Spark, which has the role to prepare data for a Machine Learning technique. A focus will be given on the automation implemented through DODAS. Then, we will show how to train an AI-based smart cache and how we implemented a training facility managed through DODAS. Finally, we provide an overview of the inference system, based on the CMS-TensorFlow as a Service and also deployed as a DODAS service.

DOI: 10.1016/j.nima.2006.09.081

First level trigger using pixel detector for the CMS experiment

A proposal for a pixel-based Level 1 trigger for the Super-LHC is presented. The trigger is based on fast track reconstruction using the full pixel granularity exploiting a readout which connects different layers in specific trigger towers. The trigger will implement the current CMS high level trigger functionality in a novel concept of intelligent detector. A possible layout is discussed and implications on data links are evaluated.

Distributed analysis with CRAB: The client-server architecture evolution and commissioning

CRAB (CMS Remote Analysis Builder) is the tool used by CMS to enable running physics analysis in a transparent manner over data distributed across many sites. It abstracts out the interaction with the underlying batch farms, grid infrastructure and CMS workload management tools, such that it is easily usable by non-experts. CRAB can be used as a direct interface to the computing system or can delegate the user task to a server. Major efforts have been dedicated to the client-server system development, allowing the user to deal only with a simple and intuitive interface and to delegate all the work to a server. The server takes care of handling the users jobs during the whole lifetime of the users task. In particular, it takes care of the data and resources discovery, process tracking and output handling. It also provides services such as automatic resubmission in case of failures, notification to the user of the task status, and automatic blacklisting of sites showing evident problems beyond what is provided by existing grid infrastructure. The CRAB Server architecture and its deployment will be presented, as well as the current status and future development. In addition the experience in using the system for more » initial detector commissioning activities and data analysis will be summarized. « less

DOI: 10.1088/1742-6596/396/3/032048

The “Common Solutions” Strategy of the Experiment Support group at CERN for the LHC Experiments

After two years of LHC data taking, processing and analysis and with numerous changes in computing technology, a number of aspects of the experiments' computing, as well as WLCG deployment and operations, need to evolve. As part of the activities of the Experiment Support group in CERN's IT department, and reinforced by effort from the EGI-InSPIRE project, we present work aimed at common solutions across all LHC experiments. Such solutions allow us not only to optimize development manpower but also offer lower long-term maintenance and support costs. The main areas cover Distributed Data Management, Data Analysis, Monitoring and the LCG Persistency Framework. Specific tools have been developed including the HammerCloud framework, automated services for data placement, data cleaning and data integrity (such as the data popularity service for CMS, the common Victor cleaning agent for ATLAS and CMS and tools for catalogue/storage consistency), the Dashboard Monitoring framework (job monitoring, data management monitoring, File Transfer monitoring) and the Site Status Board. This talk focuses primarily on the strategic aspects of providing such common solutions and how this relates to the overall goals of long-term sustainability and the relationship to the various WLCG Technical Evolution Groups.

DOI: 10.1088/1742-6596/2438/1/012031

Enabling CMS Experiment to the utilization of multiple hardware architectures: a Power9 Testbed at CINECA

Abstract The CMS software stack (CMSSW) is built on a nightly basis for multiple hardware architectures and compilers, in order to benefit from the diverse platforms. In practice, still, only x86_64 binaries are used in production, and are supported by the workload management tools in charge of production and analysis job delivery to the distributed computing infrastructure. Profiting from an INFN grant at CINECA, a PRACE Tier-0 Center, tests have been carried on using IBM Power9 nodes from the Marconi100 HPC system. A first study on the modifications needed to the standard CMS WMS systems is shown, and very positive proof-of-concept tests have been conducted up to thousands of computing cores, also including an initial utilization of the GPUs which the nodes host. The current status of the tests, including plans to support multi-architecture workflows, are shown and discussed.

DOI: 10.5281/zenodo.7883082

A dynamic and extensible web portal enabling the deployment of scientific virtual computational environments on hybrid e-infrastructures

DOI: 10.5281/zenodo.8036983

interTwin D5.1 First Architecture design and Implementation Plan

DOI: 10.1007/s10723-023-09664-z

Smart Caching in a Data Lake for High Energy Physics Analysis

Abstract The continuous growth of data production in almost all scientific areas raises new problems in data access and management, especially in a scenario where the end-users, as well as the resources that they can access, are worldwide distributed. This work is focused on the data caching management in a Data Lake infrastructure in the context of the High Energy Physics field. We are proposing an autonomous method, based on Reinforcement Learning techniques, to improve the user experience and to contain the maintenance costs of the infrastructure.

DOI: 10.48550/arxiv.2307.12579

Prototyping a ROOT-based distributed analysis workflow for HL-LHC: the CMS use case

The challenges expected for the next era of the Large Hadron Collider (LHC), both in terms of storage and computing resources, provide LHC experiments with a strong motivation for evaluating ways of rethinking their computing models at many levels. Great efforts have been put into optimizing the computing resource utilization for the data analysis, which leads both to lower hardware requirements and faster turnaround for physics analyses. In this scenario, the Compact Muon Solenoid (CMS) collaboration is involved in several activities aimed at benchmarking different solutions for running High Energy Physics (HEP) analysis workflows. A promising solution is evolving software towards more user-friendly approaches featuring a declarative programming model and interactive workflows. The computing infrastructure should keep up with this trend by offering on the one side modern interfaces, and on the other side hiding the complexity of the underlying environment, while efficiently leveraging the already deployed grid infrastructure and scaling toward opportunistic resources like public cloud or HPC centers. This article presents the first example of using the ROOT RDataFrame technology to exploit such next-generation approaches for a production-grade CMS physics analysis. A new analysis facility is created to offer users a modern interactive web interface based on JupyterLab that can leverage HTCondor-based grid resources on different geographical sites. The physics analysis is converted from a legacy iterative approach to the modern declarative approach offered by RDataFrame and distributed over multiple computing nodes. The new scenario offers not only an overall improved programming experience, but also an order of magnitude speedup increase with respect to the previous approach.

DOI: 10.2139/ssrn.4529970

Prototyping a Root-Based Distributed Analysis Workflow for Hl-Lhc: The Cms Use Case

The challenges expected for the next era of the Large Hadron Collider (LHC), both in terms of storage and computing resources, provide LHC experiments with a strong motivation for evaluating ways of rethinking their computing models at many levels. Great efforts have been put into optimizing the computing resource utilization for the data analysis, which leads both to lower hardware requirements and faster turnaround for physics analyses. In this scenario, the Compact Muon Solenoid (CMS) collaboration is involved in several activities aimed at benchmarking different solutions for running High Energy Physics (HEP) analysis workflows. A promising solution is evolving software towards more user-friendly approaches featuring a declarative programming model and interactive workflows. The computing infrastructure should keep up with this trend by offering on the one side modern interfaces, and on the other side hiding the complexity of the underlying environment, while efficiently leveraging the already deployed grid infrastructure and scaling toward opportunistic resources like public cloud or HPC centers. This article presents the first example of using the ROOT RDataFrame technology to exploit such next-generation approaches for a production-grade CMS physics analysis. A new analysis facility is created to offer users a modern interactive web interface based on JupyterLab that can leverage HTCondor-based grid resources on different geographical sites. The physics analysis is converted from a legacy iterative approach to the modern declarative approach offered by RDataFrame and distributed over multiple computing nodes. The new scenario offers not only an overall improved programming experience, but also an order of magnitude speedup increase with respect to the previous approach.

DOI: 10.22323/1.351.0014

Integration of the Italian cache federation within the CMS computing model

The next decades at HL-LHC will be characterized by a huge increase of both storage and computing requirements (between one and two orders of magnitude). Moreover we foresee a shift on resources provisioning towards the exploitation of dynamic (on private or public cloud and HPC facilities) solutions. In this scenario the computing model of the CMS experiment is pushed towards an evolution for the optimization of the amount of space that is managed centrally and the CPU efficiency of the jobs that run on "storage-less" resources. In particular the computing resources of the "Tier2" sites layer, for the most part, can be instrumented to read data from a geographically distributed cache storage based on unmanaged resources, reducing, in this way, the operational efforts by a large fraction and generating additional flexibility. The objective of this contribution is to present the first implementation of an INFN federation of cache servers, developed also in collaboration with the eXtreme Data Cloud EU project. The CNAF Tier-1 plus Bari and Legnaro Tier-2s provide unmanaged storages which have been organized under a common namespace. This distributed cache federation has been seamlessly integrated in the CMS computing infrastructure, while the technical implementation of this solution is based on XRootD, largely adopted in the CMS computing model under the "Anydata, Anytime, Anywhere project" (AAA). The results in terms of CMS workflows performances will be shown. In addition a complete simulation of the effects of the described model under several scenarios, including dynamic hybrid cloud resource provisioning, will be discussed. Finally a plan for the upgrade of such a prototype towards a stable INFN setup seamlessly integrated with production CMS computing infrastructure will be discussed.

DOI: 10.1109/nssmic.2008.4774652

CRAB: A CMS application for distributed analysis

Starting from 2008, the CMS experiment will produce several Pbytes of data every year, to be distributed over many computing centers geographically distributed in different countries. The CMS computing model defines how the data has to be distributed and accessed in order to enable physicists to run efficiently their analysis over the data. The analysis will be thus performed in a distributed way using Grid infrastructure. CRAB (CMS Remote Analysis Builder) is a specific tool, designed and developed by the CMS collaboration, that allows a transparent access to distributed data to end physicist. CRAB interacts with the local user environment, the CMS Data Management services and with the Grid middleware: it takes care of the data and resources discovery; it splits the user task in several analysis processes (jobs) and distribute and parallelize them over different Grid environments; it takes care of the process tracking and output handling. Very limited knowledge of underlying technical details are required to the end user. The tool can be used as a direct interface to the computing system or can delegate the task to a server, which takes care of the user jobs handling, providing services as automatic resubmission in case of failures and notification to the user of the task status. Its current implementation is able to interact with WLCG, gLite and OSG Grid middlewares. Furthermore it allows in the very same way the access to local data and batch systems such as LSF. CRAB has been in production and in routine use by end-users since Spring 2004. It has been extensively used in studies to prepare the Physics Technical Design Report, in the analysis of reconstructed event samples generated during the Computing Software and Analysis Challenges and in the preliminary cosmic rays data taking. The CRAB architecture and the usage inside the CMS community will be described in detail, as well as the current status and future development.

DOI: 10.1051/epjconf/202024507033

The DODAS Experience on the EGI Federated Cloud

The EGI Cloud Compute service offers a multi-cloud IaaS federation that brings together research clouds as a scalable computing platform for research accessible with OpenID Connect Federated Identity. The federation is not limited to single sign-on, it also introduces features to facilitate the portability of applications across providers: i) a common VM image catalogue VM image replication to ensure these images will be available at providers whenever needed; ii) a GraphQL information discovery API to understand the capacities and capabilities available at each provider; and iii) integration with orchestration tools (such as Infrastructure Manager) to abstract the federation and facilitate using heterogeneous providers. EGI also monitors the correct function of every provider and collects usage information across all the infrastructure. DODAS (Dynamic On Demand Analysis Service) is an open-source Platform-as-a-Service tool, which allows to deploy software applications over heterogeneous and hybrid clouds. DODAS is one of the so-called Thematic Services of the EOSC-hub project and it instantiates on-demand container-based clusters offering a high level of abstraction to users, allowing to exploit distributed cloud infrastructures with a very limited knowledge of the underlying technologies.This work presents a comprehensive overview of DODAS integration with EGI Cloud Federation, reporting the experience of the integration with CMS Experiment submission infrastructure system.

DOI: 10.1007/978-3-030-64583-0_57

Caching Suggestions Using Reinforcement Learning

Big data is usually processed in a decentralized computational environment with a number of distributed storage systems and processing facilities to enable both online and offline data analysis. In such a context, data access is fundamental to enhance processing efficiency as well as the user experience inspecting the data and the caching system is a solution widely adopted in many diverse domains. In this context, the optimization of cache management plays a central role to sustain the growing demand for data. In this article, we propose an autonomous approach based on a Reinforcement Learning technique to implement an agent to manage the file storing decisions. Moreover, we test the proposed method in a real context using the information on data analysis workflows of the CMS experiment at CERN.

DOI: 10.22323/1.378.0002

A possible solution for HEP processing on network secluded Computing Nodes

The computing needs of LHC experiments in the next decades (the so-called High Luminosity LHCm HL-LHC) are expected to increase substantially, due to the concurrent increases in the accelerator luminosity, in the selection rates and in the detectors' complexity. Many Funding Agencies are aiming to a consolidation of the national LHC computing infrastructures, via a merge with other large scale computing facilites such as HPC and Cloud centers. The LHC Experiments have started long ago tests and production activities on such centers, with intermittent success. The biggest obstacle comes from their typical stricter network policies with respect to our standard centers, which do not allow an easy merge with the distributed LHC computing infrastructure. A possible solution for such centers is presented here, able to satisfy three main goals: be user deployable, be a catch-all solution for all the protocols and services, and be transparent to the experiment software stack. It is based on the integration of existing tools like tsocks, tunsocks, openconnect, cvmfsexec and singularity. We present results from an early experimentation, which positively show how the solution is indeed usable. Large scale testing on thousands of nodes is the next step in our agenda.

DOI: 10.1016/j.parco.2021.102873

The BondMachine, a moldable computer architecture

Future systems will be characterized by the presence of many computing core in a single device, on large scale data centers or even at the level of IoT devices. The ability to fully exploit computational architectures’ heterogeneity and concurrency will be a key point. In this manuscript we present the BondMachine (BM), an innovative prototype software ecosystem aimed at creating facilities where hardware and software are co-designed, guaranteeing a full exploitation of fabric capabilities (both in terms of concurrency and heterogeneity) with several hardware optimization possibilities. The fundamental innovation of the BM is to provide a new kind of computer architecture, where the hardware dynamically adapts to the specific computational problem rather than being static and generic, as in standard CPUs synthesized in silicon. Hardware can be designed to fit precisely any computational task needs, implementing only the processing units needed and discarding generic solutions. By using BMs within FPGA technologies end-to-end solutions could be realized, in which the creation of domain-specific hardware is part of the development process as much as the software stack. FPGA technology allows to create independent processing units on a single low-power board, and to design their interconnections “in silicon” to maximally fit the design needs. The processors of the BMs are suitable for computational structures like neural networks and tensor processing models. Machine Learning (ML) and Deep Learning (DL) popularity keeps increasing in scientific and industrial areas.

DOI: 10.7287/peerj.preprints.2206v1

Evaluation of the parallel performance of the TRIGRS v2.1 model for rainfall-induced landslides

The widespread availability of high resolution digital elevation models (DEM) opens the possibility of applying physically based models of landslide initiation to large areas. With increasing size of the study area and resolution of the DEM, the required computing time for each run of the model increases proportionally to the number of grid cells in the study area. The aim of this work is to present a new parallel implementation of TRIGRS (Alvioli and Baum, 2016), an open-source FORTRAN program (software available for download at http://geomorphology.irpi.cnr.it/tools/trigrs and https://github.com/usgs/landslides-trigrs) designed for modeling the timing and distribution of shallow, rainfall-induced landslides, and to discuss its parallel performance.

DOI: 10.1117/12.2629985

The control software of the BEaTriX x-ray beam calibration facility: problems and solutions

In the context of the ATHENA mission, the BEaTriX (Beam Expander Testing X-ray) facility has been developed for the test and acceptance of the Silicon Pore Optics Modules (MM) that, once assembled, will compose the mirror of the X-ray telescope. This paper describes the software developed to control the entire facility. The language employed is LabVIEW, a control language commonly used for data acquisition, instrument control, and industrial automation. The software is composed of two independent sections: the first one is dedicated to the management of the facility during the tests of the mirror modules, as it incorporates an automated control of all the functionalities of the facility. The second one will be used for the maintenance of the facility, permitting the independent access to every single component of the system for functional checks. In the paper, the program and its functionalities are described, presenting what we have implemented to address specific problems.

DOI: 10.22323/1.093.0029

Tools to use heterogeneous Grid schedulers and storage system

DOI: 10.1088/1742-6596/396/3/032025

A gLite FTS based solution for managing user output in CMS

The CMS distributed data analysis workflow assumes that jobs run in a different location from where their results are finally stored. Typically the user output must be transferred across the network from one site to another, possibly on a different continent or over links not necessarily validated for high bandwidth/high reliability transfer. This step is named stage-out and in CMS was originally implemented as a synchronous step of the analysis job execution. However, our experience showed the weakness of this approach both in terms of low total job execution efficiency and failure rates, wasting precious CPU resources. The nature of analysis data makes it inappropriate to use PhEDEx, the core data placement system for CMS. As part of the new generation of CMS Workload Management tools, the Asynchronous Stage-Out system (AsyncStageOut) has been developed to enable third party copy of the user output. The AsyncStageOut component manages glite FTS transfers of data from the temporary store at the site where the job ran to the final location of the data on behalf of that data owner. The tool uses python daemons, built using the WMCore framework, and CouchDB, to manage the queue of work and FTS transfers. CouchDB also provides the platform for a dedicated operations monitoring system. In this paper, we present the motivations of the asynchronous stage-out system. We give an insight into the design and the implementation of key features, describing how it is coupled with the CMS workload management system. Finally, we show the results and the commissioning experience.

INDIGO-DataCloud: A data and computing platform to facilitate seamless access to e-infrastructures.

This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applications and cloud framework enhancements to manage the demanding requirements of scientific communities, either locally or through enhanced interfaces. The middleware developed allows to federate hybrid resources, to easily write, port and run scientific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solutions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our developments are freely downloadable as open source components, and are already being integrated into many scientific applications.

DOI: 10.1109/escience.2018.00082

Distributed and On-demand Cache for CMS Experiment at LHC

In the CMS [1] computing model the experiment owns dedicated resources around the world that, for the most part, are located in computing centers with a well defined Tier hierarchy. The geo-distributed storage is then controlled centrally by the CMS Computing Operations. In this architecture data are distributed and replicated across the centers following a preplacement model, mostly human controlled. Analysis jobs are then mostly executed on computing resources close to the data location. This of course allow to avoid CPU wasting due to I/O latency, although it does not allow to optimize the available job slots.

DOI: 10.22323/1.351.0020

The BondMachine toolkit: Enabling Machine Learning on FPGA

The BondMachine (BM) is an innovative prototype software ecosystem aimed at creating facilities where both hardware and software are co-designed, guaranteeing a full exploitation of fabric capabilities (both in terms of concurrency and heterogeneity) with the smallest possible power dissipation.In the present paper we will provide a technical overview of the key aspects of the BondMachine toolkit, highlighting the advancements brought about by the porting of Go code in hardware.We will then show a cloud-based BM as a Service deployment.Finally, we will focus on TensorFlow, and in this context we will show how we plan to benchmark the system with a ML tracking reconstruction from pp collision at the LHC.

DOI: 10.1007/978-3-030-58802-1_24

An Intelligent Cache Management for Data Analysis at CMS

In this work, we explore a score-based approach to manage a cache system. With the proposed method, the cache can better discriminate the input requests and improve the overall performances. We created a score based discriminator using the file statistics. The score represents the weight of a file. We tested several functions to compute the file weight used to determine whether a file has to be stored in the cache or not. We developed a solution experimenting on a real cache manager named XCache, that is used within the Compact Muon Solenoid (CMS) data analysis workflow. The aim of this work is optimizing to reduce maintaining costs of the cache system without compromising the user experience.

DOI: 10.1109/icmla51294.2020.00231

Effective Big Data Caching through Reinforcement Learning

In the era of big data, data volumes continue to grow in several different domains, from business to scientific fields. Sensors, edge devices, scientific applications and detectors generate huge amounts of data that are distributed for their nature. In order to extract value from such data requires a typical pipeline made of two main steps: first, the processing and then the data access. One of the main features for data access is fast response time, whose order of magnitude can vary a lot depending on the specific type of processing as well as processing patterns. The optimization of the access layer becomes more and more important while dealing with a geographically distributed environment where data must be retrieved from remote servers of a data lake. From the infrastructural perspectives, caching systems are used to mitigate latency and to serve better popular data. Thus, the role of the cache becomes a key to have an effective and efficient data access. In this article, we propose a Reinforcement Learning approach, using the Q-Learning technique, to improve the performances of a cache system in terms of data management. The proposed method uses two agents with different objectives and actions to control the addition and the eviction of files in the cache. The aim of this system is to increase the throughput reducing, at the same time, the cache costs, such as the amount of data written, and network utilization. Moreover, we tested our method in a context of data analysis, with information taken from High Energy Physics (HEP) workflow.

DOI: 10.22323/1.378.0009

Reinforcement Learning for Smart Caching at the CMS experiment

In the near future, High Energy Physics experiments’ storage and computing needs will go far above what can be achieved by only scaling current computing models or current infrastructures. Considering the LHC case, for 10 years a federated infrastructure (Worldwide LHC Computing Grid, WLCG) has been successfully developed. Nevertheless, the High Luminosity (HL-LHC) scenario is forcing the WLCG community to dig for innovative solutions. In this landscape, one of the initiatives is the exploitation of Data Lakes as a solution to improve the Data and Storage management. The current Data Lake model foresees data caching to play a central role as a technical solution to reduce the impact of latency and network load. Moreover, even higher efficiency can be achieved through a smart caching algorithm: this motivates the development of an AI-based approach to the caching problem. In this work, a Reinforcement Learning-based cache model (named QCACHE) is applied in the CMS experiment context. More specifically, we focused our attention on the optimization of both cache performances and cache management costs. The QCACHE system is based on two distinct Q-Learning (or Deep Q-Learning) agents seeking to find the best action to take given the current state. More explicitly, they try to learn a policy that maximizes the total reward (i.e. hit or miss occurring in a given time span). While the addition Agent is taking care of all the cache writing requests, clearly the eviction agent deals with the decision to keep or to delete files in the cache. We will present an overview of the QCACHE framework an the results in terms of cache performances, obtained using using “Real-world” data, will be compared respect to standard replacement policies (i.e. we used historical data requests aggregation used to predict dataset popularity filtered for Italian region). Moreover, we will show the planned subsequent evolution of the framework.

DOI: 10.1088/1742-6596/513/3/032079

CMS users data management service integration and first experiences with its NoSQL data storage

The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site.

DOI: 10.1088/1742-6596/513/3/032064

Experience in CMS with the common analysis framework project

ATLAS, CERN-IT, and CMS embarked on a project to develop a common system for analysis workflow management, resource provisioning and job scheduling. This distributed computing infrastructure was based on elements of PanDA and prior CMS workflow tools. After an extensive feasibility study and development of a proof-of-concept prototype, the project now has a basic infrastructure that supports the analysis use cases of both experiments via common services. In this paper we will discuss the state of the current solution and give an overview of all the components of the system.

DOI: 10.7287/peerj.preprints.2206v2

Evaluation of the parallel performance of the TRIGRS v2.1 model for rainfall-induced landslides

The widespread availability of high resolution digital elevation models (DEM) opens the possibility of applying physically based models of landslide initiation to large areas. With increasing size of the study area and resolution of the DEM, the required computing time for each run of such models increases proportionally to the number of grid cells in the study area. The aim of this work is to present a new parallel implementation of TRIGRS (Alvioli and Baum, 2016), an open-source FORTRAN program designed for modeling the timing and distribution of shallow, rainfall-induced landslides, and to discuss its parallel performance. We investigated the parallel performance by evaluating running time, speedup and efficiency of the code on a commonly available multi-core machine, on a high-end multi-node machine and on a cloud computing environment, showing the advantages and limitations of each case and discussing the possible weak points of using a general-purpose cloud environment

DOI: 10.1088/1742-6596/664/2/022013

Monitoring the delivery of virtualized resources to the LHC experiments

The adoption of cloud technologies by the LHC experiments places the fabric management burden of monitoring virtualized resources upon the VO. In addition to monitoring the status of the virtual machines and triaging the results, it must be understood if the resources actually provided match with any agreements relating to the supply. Monitoring the instantiated virtual machines is therefore a fundamental activity and hence this paper describes how the Ganglia monitoring system can be used for the cloud computing resources of the LHC experiments. Expanding upon this, it is then shown how the integral of the time-series monitoring data obtained can be re-purposed to provide a consumer-side accounting record, which can then be compared with the concrete agreements that exist between the supplier of the resources and the consumer. From this alone, it is not clear though how the performance of the resources differ both within and between providers. Hence, the case is made for a benchmarking metric to normalize the data along with some results from a preliminary investigation on obtaining such a metric.

DOI: 10.7287/peerj.preprints.2206

Evaluation of the parallel performance of the TRIGRS v2.1 model for rainfall-induced landslides

The widespread availability of high resolution digital elevation models (DEM) opens the possibility of applying physically based models of landslide initiation to large areas. With increasing size of the study area and resolution of the DEM, the required computing time for each run of such models increases proportionally to the number of grid cells in the study area. The aim of this work is to present a new parallel implementation of TRIGRS (Alvioli and Baum, 2016), an open-source FORTRAN program designed for modeling the timing and distribution of shallow, rainfall-induced landslides, and to discuss its parallel performance. We investigated the parallel performance by evaluating running time, speedup and efficiency of the code on a commonly available multi-core machine, on a high-end multi-node machine and on a cloud computing environment, showing the advantages and limitations of each case and discussing the possible weak points of using a general-purpose cloud environment

CMS experiment: development and operational aspects

While a majority of CMS data analysis activities rely on the distributed computing infrastructure on the WLCG Grid, dedicated local computing facilities have been deployed to address particular requirements in terms of latency and scale. The CMS CERN Analysis Facility (CAF) was primarily designed to host a large variety of latency-critical workows. These break down into alignment and calibration, detector commissioning and diagnosis, and high-interest physics analysis requiring fast turnaround. In order to reach the goal for fast turnaround tasks, the Workload Management group has designed a CRABServer based system to t with two main needs: to provide a simple, familiar interface to the user (as used in the CRAB Analysis Tool[7]) and to allow an easy transition to the Tier-0 system. While the CRABServer component had been initially designed for Grid analysis by CMS end-users, with a few modications it turned out to be also a very powerful service to manage and monitor local submissions on the CAF. Transition to Tier-0 has been guaranteed by the usage of the WMCore, a library developed by CMS to be the common core of workload management tools, for handing data driven workow dependencies. This system is now being used with the rst use cases, and important experience is being acquired. In addition to the CERN CAF facility, FNAL has CMS dedicated analysis resources at the FNAL LHC Physics Center (LPC). In the rst few years of data collection FNAL has been able to accept a large fraction of CMS data. The remote centre is not well suited for the extremely low latency work expected of the CAF, but the presence of substantial analysis resources, a large resident community, and a large fraction of the data make the LPC a strong facility for resource intensive analysis. We present the building, commissioning and operation of these dedicated analysis facilities in the rst year of LHC collisions; we also present the specic development to our software needed to allow for the use of these computing facilities in the special use cases of fast turnaround analyses.

DOI: 10.1109/nssmic.2017.8533143

A container-based solution to generate HTCondor Batch Systems on demand exploiting heterogeneous Clouds for data analysis

This paper describes the Dynamic On Demand Analysis Service (DODAS), an automated system that simplifies the process of provisioning, creating, managing and accessing a pool of heterogeneous computing and storage resources, by generating clusters to run batch systems thereby implementing the "Batch System as a Service" paradigm. DODAS is built on several INDIGO-DataCloud services among which: the PaaS Orchestrator, the Infrastructure Manager, and the Identity and Access Manager are the most important. The paper describes also a successfully integration of DODAS with the computing infrastructure of the Compact Muon Solenoid (CMS) experiment installed at LHC.

Tools to use heterogeneous Grid schedulers and storage system

DOI: 10.1142/9789812819093_0076

CMS DATA AND WORKFLOW MANAGEMENT SYSTEM

DOI: 10.1016/j.nuclphysbps.2009.10.042

End user analysis model at CMS

The CMS experiment at LHC has had a distributed computing model since early in the project plan. The geographically distributed computing system is based on a hierarchy of tiered regional computing centers; data reconstructed at Tier-0 are then distributed and archived at Tier-1 where re-reconstruction on data events is performed and computing resources for skimming and selection are provided. The Tier-2 centers are the primary location for analysis activities. The analysis will be thus performed in a distributed way using Grid infrastructure. The CMS computing model architecture has also the goal to enable thousands physicist collaboration worldwide spread (about 2600 from 180 scientific institutes) to access data. In order to require to the end user a very limited knowledge of underlying technical details, CMS has been developed a set of specific tools using the Grid services. This model is being tested in many Grid Service Challenges of increasing complexity, coordinated with the Worldwide LHC Computing Grid community. In this talk the status, plans, and prospects for CMS analysis using the Grid are presented.

DOI: 10.1109/nssmic.2008.4774773

Distributed computing and data analysis in the CMS Experiment

The CMS Experiments at the Large Hadron Collider (LHC) at CERN/Geneva is expected to start taking data during summer 2008. The CMS Computing, Software and Analysis projects will need to meet the expected performances in terms of data archiving, calibration and reconstruction at the host laboratory, as well as data transferring to many computing centers located around the word, where further archiving and re-processing will take place. Hundreds of physicists will then expect to find the necessary infrastructure to easily access and start analysing the long awaited LHC data. In recent years, CMS has conducted a series of Computing, Software, and Analysis challenges to demonstrate the functionality, scalability and usability of the relevant components and infrastructure. These challenges have been designed to validate the CMS distributed computing model [1] and to run operations in quasi-real data taking conditions. We will present the CMS readiness in terms of data archiving, offline processing, data transferring and data analysis. We will particularly focus on the achieved metrics during 2008 and potentially on first data taking experiences.

CRAB: an Application for Distributed Scientific Analysis in Grid Projects

DOI: 10.1393/ncb/i2008-10542-6

CMS computing and data handling

DOI: 10.1393/ncb/i2008-10689-0

Top properties: Prospects at CMS

Experience in Testing the Grid Based Workload Management System of a LHC Experiment.

DOI: 10.5281/zenodo.6968509

EGI-ACE D7.3 Final version HPC integration Handbook

DOI: 10.48550/arxiv.2208.06437

Smart caching in a Data Lake for High Energy Physics analysis

The continuous growth of data production in almost all scientific areas raises new problems in data access and management, especially in a scenario where the end-users, as well as the resources that they can access, are worldwide distributed. This work is focused on the data caching management in a Data Lake infrastructure in the context of the High Energy Physics field. We are proposing an autonomous method, based on Reinforcement Learning techniques, to improve the user experience and to contain the maintenance costs of the infrastructure.

DOI: 10.22323/1.415.0023

Running Fermi-LAT analysis on Cloud: the experience with DODAS with EGI-ACE Project

The Fermi-LAT long-term Transient (FLT) monitoring aim is the routine search of -ray sources on monthly time intervals of Fermi-LAT data.The FLT analysis consists of two steps: first the monthly data sets were analyzed using a waveletbased source detection algorithm that provided the candidate new transient sources; finally these transient candidates were analyzed using the standard Fermi-LAT maximum likelihood analysis method.Only sources with a statistical significance above 4 in at least one monthly bin were listed in a catalog.The strategy adopted to implement the maximum likelihood analysis pipeline has been based on cloud solutions adopting the Dynamic On Demand Analysis Service (DODAS) [1] service as technology enabler.DODAS represents a solution to transparently exploit cloud computing with almost zero effort for a user community.This contribute will detail the technical implementation providing the point of view of the user community.

DOI: 10.22323/1.415.0012

Cloud native approach for Machine Learning as a Service for High Energy Physics

Nowadays Machine Learning (ML) techniques are widely adopted in many areas of High-Energy Physics (HEP) and certainly will play a significant role also in the upcoming High-Luminosity LHC (HL-LHC) upgrade foreseen at CERN.A huge amount of data will be produced by LHC and collected by the experiments, facing challenges at the exascale.Here, we present Machine Learning as a Service solution for HEP (MLaaS4HEP) to perform an entire ML pipeline (in terms of reading data, processing data, training ML models, serving predictions) in a completely model-agnostic fashion, directly using ROOT files of arbitrary size from local or distributed data sources.With the new version of MLaaS4HEP code based on uproot4, we provide new features to improve users' experience with the framework and their workflows, e.g.users can provide some preprocessing operations to be applied to ROOT data before starting the ML pipeline.Then our approach is extended to use local and cloud resources via HTTP proxy which allows physicists to submit their workflows using the HTTP protocol.We discuss how this pipeline could be enabled in the INFN Cloud Provider and what could be the final architecture.

DOI: 10.22323/1.415.0022

Open-source and cloud-native solutions for managing and analyzing heterogeneous and sensitive clinical Data

The requirement for an effective handling and management of heterogeneous and possibly confidential data continuously increases within multiple scientific domains.PLANET (Pollution Lake ANalysis for Effective Therapy) is a INFN-funded research initiative aiming to implement an observational study to assess a possible statistical association between environmental pollution and Covid19 infection, symptoms and course.PLANET is built on a "data-centric" based approach that takes into account clinical components, environmental and pollution conditions, complementing primary data and many eventual confounding factors such as population density, commuter density, socio-economic metrics and more.Besides the scientific one, the main technical challenge of the project is about collecting, indexing, storing and managing many types of datasets while guaranteeing FAIRness as well as adherence to the prescribed regulatory frameworks, such as those granted by the General Data Protection Regulation, GDPR.In this contribution we describe the developed open-source DataLake platform, detailing its key features: the event-based storage system provided by MinIO, which allows automatic metadata processing; the data-ingestion pipeline implemented via Argo Workflows; the GraphQL interface to query object metadata; finally, the seamless integration of the platform within a compute multi-user environment, showing how all these frameworks are integrated in the Enhanced PrIvacy and Compliance (EPIC) Cloud partition of the INFN Cloud federation.

DOI: 10.5281/zenodo.7189370

EGI-ACE D2.7 Technical specifications for compute common services

DOI: 10.22323/1.414.0968

Prototype of a cloud native solution of Machine Learning as Service for HEP

To favor the usage of Machine Learning (ML) techniques in High-Energy Physics (HEP) analyses it would be useful to have a service allowing to perform the entire ML pipeline (in terms of reading the data, training a ML model, and serving predictions) directly using ROOT files of arbitrary size from local or remote distributed data sources.The MLaaS4HEP framework aims to provide such kind of solution.It was successfully validated with a CMS physics use case which gave important feedback about the needs of analysts.For instance, we introduced the possibility for the user to provide pre-processing operations, such as defining new branches and applying cuts.To provide a real service for the user and to integrate it into the INFN Cloud, we started working on MLaaS4HEP cloudification.This would allow to use cloud resources and to work in a distributed environment.In this work, we provide updates on this topic, and in particular, we discuss our first working prototype of the service.It includes an OAuth2 proxy server as authentication/authorization layer, a MLaaS4HEP server, an XRootD proxy server for enabling access to remote ROOT data, and the TensorFlow as a Service (TFaaS) service in charge of the inference phase.With this architecture the user is able to submit ML pipelines, after being authenticated and authorized, using local or remote ROOT files simply using HTTP calls.

CRAB: an Application for Distributed Scientific.

DOI: 10.1016/j.nima.2006.09.089

First performance studies of a pixel-based trigger in the CMS experiment

An important tool for the discovery of new physics at LHC is the design of a low level trigger with an high power of background rejection. The contribution of pixel detector to the lowest level trigger at CMS is studied focusing on low-energy jet identification, matching the information from calorimeters and pixel detector. In addition, primary vertex algorithms are investigated. The performances are evaluated in terms of, respectively, QCD rejection and multihadronic jets final states efficiency.

DOI: 10.22323/1.327.0009

Harvesting dispersed computational resources with Openstack: a Cloud infrastructure for the Computational Science community

Harvesting dispersed computational resources is nowadays an important and strategic topic especially in an environment, like the computational science one, where computing needs constantly increase.On the other hand managing dispersed resources might not be neither an easy task not costly effective.We successfully explored the use of OpenStack middleware to achieve this objective, our man goal is not only the resource harvesting but also to provide a modern paradigm of computing and data usage access.In the present work we will illustrate a real example on how to build a geographically distributed cloud to share and manage computing and storage resources, owned by heterogeneous cooperating entities

DOI: 10.1088/1755-1315/509/1/012034

HUSH app: digital tools to explore the natural patrimony of urban areas

Abstract In recent years, urban trekking and geotourism have gained relevance in the tourism industry. This has been accompanied by an increasing use of digital tools and techniques for developing immersive touristic experiences. The HUSH project aim is to enhance and disseminate the scientific and naturalistic heritage of urban areas by means of a mobile application. The naturalistic and geological components with relevant scientific value are first identified and tagged as Points of Interest (POIs). Then, high quality multimedia contents are created for each of them. Using the HUSH app and implementing Augmented Reality techniques, users can explore these contents by framing the POIs with their device. They can decide which POIs they want to visit by using keyword-search, by selecting a predefined path or by means of ‘intelligent search’, which applies data mining techniques to users’ data. Moreover, they can propose the addition of new POIs by using the Scientific Reporter tool and rate POIs they visit.

DOI: 10.1016/j.nima.2005.11.124

The CMS analysis chain in a distributed environment

The CMS collaboration is undertaking a big effort to define the analysis model and to develop software tools with the purpose of analysing several millions of simulated and real data events by a large number of people in many geographically distributed sites. From the computing point of view, one of the most complex issues when doing remote analysis is the data discovery and access. Some software tools were developed in order to move data, make them available to the full international community and validate them for the subsequent analysis. The batch analysis processing is performed with workload management tools developed on purpose, which are mainly responsible for the job preparation and the job submission. The job monitoring and the output management are implemented as the last part of the analysis chain. Grid tools provided by the LCG project are evaluated to gain access to the data and the resources by providing a user friendly interface to the physicists submitting the analysis jobs. An overview of the current implementation and of the interactions between the previous components of the CMS analysis system is presented in this work.

Enhancing cache content management in a data lake architecture using Reinforcement Learning

DOI: 10.1051/epjconf/202125102045

First experiences with a portable analysis infrastructure for LHC at INFN

The challenges proposed by the HL-LHC era are not limited to the sheer amount of data to be processed: the capability of optimizing the analyser's experience will also bring important benefits for the LHC communities, in terms of total resource needs, user satisfaction and in the reduction of end time to publication. At the Italian National Institute for Nuclear Physics (INFN) a portable software stack for analysis has been proposed, based on cloud-native tools and capable of providing users with a fully integrated analysis environment for the CMS experiment. The main characterizing traits of the solution consist in the user-driven design and the portability to any cloud resource provider. All this is made possible via an evolution towards a “python-based” framework, that enables the usage of a set of open-source technologies largely adopted in both cloud-native and data-science environments. In addition, a “single sign on”-like experience is available thanks to the standards-based integration of INDIGO-IAM with all the tools. The integration of compute resources is done through the customization of a JupyterHUB solution, able to spawn identity-aware user instances ready to access data with no further setup actions. The integration with GPU resources is also available, designed to sustain more and more widespread ML based workflow. Seamless connections between the user UI and batch/big data processing framework (Spark, HTCondor) are possible. Eventually, the experiment data access latency is reduced thanks to the integrated deployment of a scalable set of caches, as developed in the context of ESCAPE project, and as such compatible with the future scenarios where a data-lake will be available for the research community. The outcome of the evaluation of such a solution in action is presented, showing how a real CMS analysis workflow can make use of the infrastructure to achieve its results.

DOI: 10.22323/1.378.0019

Machine Learning as a Service for High Energy Physics on heterogeneous computing resources

Machine Learning as a Service for HEP on heterogeneous computing resources L. Giommi Figure 1: The current schedule for LHC and HL-LHC upgrade and run.Currently, the start of the HL-LHC run is foreseen at the end of 2027.

DOI: 10.22323/1.378.0003

Enabling HPC systems for HEP: the INFN-CINECA Experience

In this report we want to describe a successful integration exercise between CINECA (PRACE Tier-0) Marconi KNL system and LHC processing. A production-level system has been deployed using a 30 Mhours grant from the 18th Call for PRACE Project Access; thanks to CINECA, more than 3x the granted hours were eventually made available. Modifications at multiple levels were needed: on experiments' WMS layers, on site level access policies and routing, on virtualization. The success of the integration process paves the way to integration with additional local systems, and in general shows how the requirements of a HPC center can coexist with the needs from data intensive, complex distributed workflows.

DOI: 10.5281/zenodo.5526126

EGI-ACE D2.3 Technical specifications for compute common services

DOI: 10.5281/zenodo.6602270

EGI-ACE D2.3 Technical specifications for compute common services