Evaluation of Power CPU architecture for deep learning

Project goal

We are investigating the performance of distributed training and inference of different deep-learning models on a cluster consisting of IBM Power8 CPUs (with NVIDIA V100 GPUs) installed at CERN. A series of deep neural networks is being developed to reproduce the initial steps in the data-processing chain of the DUNE experiment. More specifically, a combination of convolutional neural networks and graph neural networks are being designed to reduce noise and select specific portions of the data to focus on during the reconstruction step (region selector).

R&D topic
Machine learning and data analytics
Project coordinator(s)
Maria Girone and Sofia Vallecorsa
Team members
Marco Rossi
Collaborator liaison(s)
Eric Aquaronne, Oliver Bethmann

Collaborators

Project background

Neutrinos are elusive particles: they have a very low probability of interacting with other matter. In order to maximise the likelihood of detection, neutrino detectors are built as large, sensitive volumes. Such detectors produce very large data sets. Although large in size, these data sets are usually very sparse, meaning dedicated techniques are needed to process them efficiently. Deep-learning methods are being investigated by the community with great success.

Recent progress

We have developed a deep neural network architecture based on a combination of  two-dimensional convolutional layers and graphs. These networks can  analyse both real and simulated data from protoDUNE and perform the region selection and de-noising tasks, which are usually applied to the raw detector data before any other processing is run.

Both of these methods improve on the classical approaches currently integrated in the experiment software stack. In order to reduce training time and set up hyper-parameter scans, the training process for the networks is parallelised and has been benchmarked on the IBM Minsky cluster.

In accordance with the concept of data-parallel distributed learning, we trained our models on a total of twelve GPUs, distributed over the three nodes that comprise the test Power cluster. Each GPU ingests a unique part of the physics dataset for training the model.

Next steps

We will work to further optimise our region-selection and noise-reduction models for the DUNE data. We will test its performance on real data collected from ProtoDUNE, the prototype experiment built at CERN.

Today, high-resolution images (millions of pixels) representing DUNE data are split into a series of small crops (32x32 pixels). A new U-Net architecture approach is being investigated in order to overcome this limitation and process entire images in one single step, thus accelerating the whole data-processing process.

Our plan is to then extend this approach to perform several other steps in the data-processing chain. Our ultimate, long-term goal is to develop a tool capable of processing the raw data from the DUNE experiment, thus making it possible to replace the entire offline reconstruction approach.

Publications

    M. Rossi, S. Vallecorsa, Deep Learning Strategies for ProtoDUNE Raw Data Denoising. Published at Springer Nature, 2022. cern.ch/go/kzj6

Presentations

    A. Hesam, Evaluating IBM POWER Architecture for Deep Learning in High-Energy Physics (23 January). Presented at CERN openlab Technical Workshop, Geneva, 2018. cern.ch/go/7BsK
    D. H. Cámpora Pérez, ML based RICH reconstruction (8 May). Presented at Computing Challenges meeting, Geneva, 2018. cern.ch/go/xwr7
    D. H. Cámpora Pérez, Millions of circles per second. RICH at LHCb at CERN (7 June). Presented as a seminar in the University of Seville, Seville, 2018.
    M. Rossi, Deep Learning strategies for ProtoDUNE raw data denoising (18th May). Presented at 25th International Conference on Computing in High-Energy and Nuclear Physics, vCHEP2021, Geneva, 2021. cern.ch/go/VK7P
    M. Rossi, Slicing with DL at ProtoDUNE-SP (29th November). Presented at 20th International Workshop on Advanced Computing and Analysis Techniques in Physics Research, ACAT2021, Daejeon, 2021. cern.ch/go/Z6jT