Data analytics in the cloud

Project goal

This project is evaluating solutions that combine data engineering, machine-learning and deep-learning tools. They are being run using cloud resources — from Oracle Cloud Infrastructure (OCI) — and address a number of use cases of interest to CERN’s community. This activity will enable us to compare performance, maturity, and stability of solutions deployed on CERN’s infrastructure with the ones in OCI.

R&D topic
Machine learning and data analytics
Project coordinator(s)
Eric Grancher and Eva Dafonte Perez
Team members
Luca Canali, Riccardo Castellotti
Collaborator liaison(s)
Vincent Leocorbo, Cristobal Pedregal-Martin, David Ebert, Dmitrij Dolgušin

Collaborators

Project background

Big-data tools — particularly related to data engineering and machine learning — are evolving rapidly. As these tools reach maturity and are adopted more broadly, new opportunities are arising for extracting value out of large data sets.

Recent years have seen growing interest from the physics community in machine learning and deep learning. One important activity in this area has been the development of pipelines for real-time classification of particle-collision events recorded by the detectors of the LHC experiments. Filtering events using so-called “trigger” systems is set to become increasingly complex as upgrades to the LHC increase the rate of particle collisions.

Recent progress

SWAN is a platform for performing interactive data analysis in the cloud. It was developed at CERN and integrates software, compute, and storage resources used by CERN physicists and data scientists. In 2020, we deployed a proof-of-concept version of SWAN on OCI resources.

As part of this work, we developed a custom Kubernetes deployment on OCI resources, in order to take advantage of GPU resources. This proved that it is possible to run interactive analytics workflows and ML in OCI while accessing datasets from CERN’s storage systems.

We also performed a distributed machine-learning training exercise, with a recurrent neural network using over 250 GB of data. For this, we utilised 500 CPU cores and 10 GPUs using Kubernetes on OCI. Using OCI showed us that public clouds are particularly convenient for use cases that need a large number of resources for a short amount of time.

Next steps

The focus of the project in 2021 will include work to integrate CERN’s analytics platform with OCI, enabling users to run their workloads on remote cloud resources using a common interface.

Over the longer term, we are planning to add new features to the analytics platform, with a focus on improving the full lifecycle of the development of machine-learning use cases.

Publications

    M. Bień, Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure. Zenodo, 2019. cern.ch/go/lhH9
    M. Migliorini, R. Castellotti, L. Canali, M. Zanetti, Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics. arXiv e-prints, p. arXiv:1909.10389 [cs.DC], 2019. cern.ch/go/8CpQ
    T. Nguyen et al., Topology classification with deep learning to improve real-time event selection at the LHC, 2018. cern.ch/go/8trZ
    M. Migliorini, R. Castellotti, L. Canali, M. Zanetti, Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics. Published in Computing and Software for Big Science 4, 2020. cern.ch/go/Z98M
    R.Castellotti, L. Canali, Distributed Deep Learning for Physics with TensorFlow and Kubernetes. Databases at CERN blog, 2020. cern.ch/go/8Tpl
    M. Migliorini, R. Castellotti, L. Canali, M. Zanetti, Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics. Published on Computing and Software for Big Science, 2020. cern.ch/go/S8wV

Presentations

    L. Canali, “Big Data In HEP” - Physics Data Analysis, Machine learning and Data Reduction at Scale with Apache Spark (24 September). Presented at IXPUG 2019 Annual Conference, Geneva, 2019. cern.ch/go/6pr6
    L. Canali, Deep Learning Pipelines for High Energy Physics using Apache Spark with Distributed Keras on Analytics Zoo (16 October). Presented at Spark Summit Europe, Amsterdam, 2019. cern.ch/go/xp77
    R. Castellotti, L. Canali, P. Kothuri, SWAN: Powering CERN’s Data Analytics and Machine Learning Use cases (22 October). Presented at 4th Inter-experiment Machine Learning Workshop, CERN, 2020. cern.ch/go/9XPw