Designing and operating distributed data infrastructures and computing centres poses challenges in areas such as networking, architecture, storage, databases, and cloud. These challenges are amplified and added to when operating at the extremely large scales required by major scientific endeavours. CERN is evaluating different models for increasing computing and data-storage capacity, in order to accommodate the growing needs of the LHC experiments over the next decade. All models present different technological challenges. In addition to increasing the on-premise capacity of the systems used for traditional types of data processing and storage, explorations are being carried out into a number of complementary distributed architectures and specialised capabilities offered by cloud and HPC infrastructures. These will add heterogeneity and flexibility to the data centres, and should enable advances in resource optimisation.

 

Project goal

The aim of this project is to demonstrate the scalability and performance of Kubernetes and Google Cloud, validating this set-up for future computing models. The focus is on taking both existing and new high-energy physics (HEP) use cases and exploring the best suited and most cost-effective set-up for each of them.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Ricardo Manuel Brito da Rocha
Collaborator liaison(s)
Grazia Frontoso, Karan Bhatia, Kevin Kissell

Collaborators

Project background

Looking towards the next-generation tools and infrastructure that will serve HEP use cases, exploring external cloud resources opens up a wide range of new possibilities for improved workload performance. It can also help us to improve efficiency in a cost-effective way.

The project relies on well-established APIs and tools supported by most public cloud providers – particularly Kubernetes and other Cloud Native Computing Foundation (CNCF) projects in its ecosystem – to expand available on-premises resources to Google Cloud. Special focus is put on use cases with spiky usage patterns, as well as those that can benefit from scaling out to large numbers of GPUs and other dedicated accelerators like TPUs.

Both traditional physics analysis and newer computing models based on machine learning are considered by this project.

Recent progress

The year started with consolidation of the work from the previous year. New runs of more traditional physics analysis helped with the validation of Google Cloud as a viable and performant solution for scaling our HEP workloads. A major milestone for this work was the publication of a CERN story on the Google Cloud Platform website (see publications).

The main focus in 2020 though was on evaluating next-generation workloads at scale. We targeted machine learning in particular, as this places significant requirements on GPUs and other types of accelerators, like TPUs.

In addition to demonstrating that HEP machine-learning workloads can scale out linearly to hundreds of GPUs in parallel, we also demonstrated that public cloud resources have the potential to offer HEP users a way to speed up their workloads by at least an order of magnitude in a cost-effective way.

Next steps

Work is now continuing in the following areas:

  • Further expanding the existing setup to burst on-premises services to public cloud resources, building on cloud-native technologies.
  • Onboarding more use cases where on-demand resource expansion is beneficial, with a special focus on workloads requiring accelerators (such as those for machine learning).
  • Expanding the cost analysis for each of these workloads and improving the feedback to end users on the efficiency of their workloads.

Publications

    R. Rocha, L. Heinrich, Higgs-demo. Published on GitHub. 2019. cern.ch/go/T8QQ
    R. Rocha, Helping researchers at CERN to analyse powerful data and uncover the secrets of our universe. Published at the Google Cloud Platform CERN, 2020. cern.ch/go/Q7Tn

Presentations

    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (21 May). Presented at Kubecon Europe 2019, Barcelona, 2019. cern.ch/go/PlC8
    R. Rocha, L. Heinrich, Higgs Analysis on Kubernetes using GCP (19 September). Presented at Google Cloud Summit, Munich, 2019. cern.ch/go/Dj8f
    R. Rocha, L. Heinrich, Reperforming a Nobel Prize Discovery on Kubernetes (7 November). Presented at the 4th International Conference on Computing in High-Energy and Nuclear Physics (CHEP), Adelaide, 2019. cern.ch/go/6Htg
    R. Rocha, L. Heinrich, Deep Dive into the Kubecon Higgs Analysis Demo (5 July). Presented at CERN IT Technical Forum, Geneva, 2019. cern.ch/go/6zls

Oracle WebLogic on Kubernetes

Project goal

The aim of this project is to improve the deployment of Oracle WebLogic infrastructure in large-scale deployments, profiting from new technologies such as Kubernetes and Docker Containers. These technologies will help to make the infrastructure deployment process portable, repeatable, and faster, enabling CERN service managers to be more efficient in their daily work.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Antonio Nappi
Team members
Lukas Gedvilas
Collaborator liaison(s)
Monica Riccelli, Will Lyons, Maciej Gruzka, Cris Pedregal, David Ebert, Dmitrij Dolgušin

Collaborators

Project background

The Oracle WebLogic service has been active at CERN for many years, offering a very stable way to run applications core to the laboratory. However, we would like to reduce the amount of time we spend on maintenance tasks or creating new environments for our users. Therefore, we started to explore solutions that could help us to improve how we deploy Oracle WebLogic. Kubernetes has now made our deployment much faster, reducing the time spent on operational tasks and enabling us to focus more on developers’ needs.

Recent progress

In 2020, we focused on the structure of monitoring and logging systems; these dramatically improved the way we work.

Using Prometheus for the monitoring part, we were able to get much more information on both the infrastructure itself and the application layer. We can now easily see resource use level in Kubernetes, as well as how containers are behaving. Thanks to this, it will be much easier to build an efficient alert system.

We decided to introduce Fluentd as our logging component. This helped us to reduce the number of hosts writing to elastic search and to standardise the logs produced by our system. Meanwhile, we also managed to migrate more applications to Kubernetes: 70-80% of our current production is now running on K8s.

 

Next steps

We aim to complete the migration of production applications by mid 2021. We will also focus on the integration of Prometheus into our new alert system. In addition, we would also like to start investigating ArgoCD and Flux2 as solutions for making application and infrastructure deployments more oriented towards GitOps.

Publications

    A. Nappi. HAProxy High Availability Setup. Databases at CERN blog. 2017. cern.ch/go/9vPf
    A. Nappi. HAProxy Canary Deployment. Databases at CERN blog. 2017. cern.ch/go/89ff

Presentations

    A. Nappi, WebLogic on Kubernetes at CERN (16 May). Presented at WebLogic Server Summit, Rome, 2019.
    A. Nappi, One Tool to Rule Them All: How CERN Runs Application Servers on Kubernetes (16 September). Presented at Oracle Code One 2019, San Francisco, 2019. cern.ch/go/DbG9
    D. Ebert (Oracle), M. Martin, A. Nappi, Advancing research with Oracle Cloud (18 September). Presented at Oracle OpenWorld 2019, San Francisco, 2019. cern.ch/go/LH6Z
    E. Screven, A. Nappi, Cloud Platform and Middleware Strategy and Roadmap (17 September). Presented at Oracle OpenWorld 2019, San Francisco, 2019. cern.ch/go/d8PC
    M. Riccelli, A. Nappi, Kubernetes: The Glue Between Oracle Cloud and CERN Private Cloud (17 September). Presented at Oracle OpenWorld 2019, San Francisco, 2019. cern.ch/go/Bp8w
    A. Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (17 January). Presented at CERN openlab meeting with Oracle in Geneva, Geneva, 2017. cern.ch/go/6Z8R
    S. A. Monsalve, Development of WebLogic 12c Management Tools (15 August). Presented at CERN openlab summer students’ lightning talks, Geneva 2017. cern.ch/go/V8pM
    A. Nappi, L. Rodriguez Fernández, WebLogic on Kubernetes (16 August). Presented at Oracle Workshop Bristol, Bristol, 2017. cern.ch/go/6Z8R
    A. Nappi, WebLogic on Kubernetes (21 September). Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/6Z8R
    A. Nappi, L. Rodriguez Fernández, Oracle Weblogic on Containers: Beyond the frontiers of your Data Centre Openday (21 September). Presented at CERN openlab Open Day, Geneva, 2017. cern.ch/go/nrh8
    A. Nappi, L. Gedvilas L. Rodríguez Fernández, A. Wiecek, B. Aparicio Cotarelo (9-13 July). Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, Bulgaria, 2018. cern.ch/go/dW8J
    A. Nappi, L. Gedvilas L. Rodríguez Fernández, A. Wiecek, B. Aparicio Cotarelo (9-13 July). Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, Bulgaria, 2018. cern.ch/go/dW8J
    L. Rodriguez Fernandez, A. Nappi, Weblogic on Kubernetes (11 January). Presented at CERN Openlab Technical Workshop, Geneva, 2018. cern.ch/go/6Z8R
    B. Cotarelo, Oracle Weblogic on Kubernetes (July). Presented at 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP), Sofia, 2018. cern.ch/go/6MVQ
    M. Riccelli, D. Cabelus, A. Nappi, Running a Modern Java EE Server in Containers Inside Kubernetes (23 October). Presented at Oracle OpenWorld 2018, San Francisco, 2018. cern.ch/go/b6nl
    W. Coekaerts, A. Nappi, Cloud Platform and Middleware Strategy and Roadmap (15 January). Presented at Oracle OpenWorld Middle West, Dubai, 2020. cern.ch/go/9kqK
    M. Gruszka, W. Lyons, A. Nappi, Deploying Oracle WebLogic Server on Kubernetes and Oracle Cloud (15 January). Presented at Oracle OpenWorld Middle West, Dubai, 2020. cern.ch/go/6FVF
    W. Coekaerts, M. McMaster, R. Hussain, A. Nappi, Cloud Platform and Middleware Strategy and Roadmap (13 February). Presented at Oracle Openworld, London, 2020. cern.ch/go/f8FZ
    M. Gruszka, W. Lyons, A. Nappi, Deploying Oracle WebLogic Server on Kubernetes and Oracle Cloud (13 February). Presented at Oracle Openworld, London, 2020. cern.ch/go/6SLl

Hybrid disaster-recovery solution using public cloud

Project goal

In 2020, the Database Services group in the CERN IT department launched this project, in collaboration with Oracle, to explore how integration of commercial cloud platforms with CERN on-premises systems might improve the resilience of services. The aim of this project is to understand the benefits, limitations and cost of the cloud environment for application servers and databases.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Viktor Kozlovzky
Team members
Aimilios Tsouvelekakis, Alina Andreea Grigore, Andrei Dumitru, Antonio Nappi, Arash Khodabandeh, Artur Wiecek, Borja Aparicio Cotarelo, Edoardo Martelli, Ignacio Coterillo Coz, Sebastian Lopienski
Collaborator liaison(s)
Cris Pedregal, Alexandre Reigada, David Ebert, Vincent Leocorbo, Dmitrij Dolgušin

Collaborators

Project background

Today, high availability is a key requirement for most platforms or services. The Database Services group maintains critical services for CERN that are used daily by the majority of users. This project enables us to review the current system, assess its scalability and explore further capabilities by integrating them with new technologies.

Recent progress

For the integration exercise, the group is using the Oracle Cloud Infrastructure (OCI).

The project team successfully replicated its process for creating virtual machines on OCI. This involved registering public-cloud virtual machines on the CERN main network, as well as integrating them with CERN’s central configuration management system.

The ability to deploy machines on OCI within the CERN network enabled us to use OCI running Oracle databases like on-premises Oracle databases. Moreover, we automated the procedure for creating standby databases for our on-premises primary databases, and we configured a Data Guard broker for data synchronisation between them. We performed tests with different data sets to evaluate the performance on the network and storage sides.

The team maintaining Kubernetes applications investigated the container engine provided by Oracle for the Kubernetes (OKE) platform. We extended the application deployment process, which is now capable of deploying applications to the OCI running OKE platform.

Furthermore, using an Oracle REST Data Services (ORDS) application, we performed a complex integration test; for forwarding network traffic, we used a proxy server. The tests showed vulnerabilities and integration limitations for the applications. 

Next steps

We will continue to evaluate the performance of storage and network for databases and see what the best fits would be for our use cases. We will also investigate the cost of running the databases on OCI compared to running them in CERN’s data centre.

The members of the project would like to thank the support teams for Oracle Cloud Infrastructure and Oracle Terraform for their valuable assistance.

EOS productisation

Project goal

This project is focused on the evolution of CERN’s EOS large-scale storage system. The goal is to simplify the usage, installation, and maintenance of the system. In addition, the project aims to add native support for new client platforms, expand documentation, and implement new features/integration with other software packages.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Luca Mascetti
Team members
Fabio Luchetti, Elvin Sindrilaru
Collaborator liaison(s)
Gregor Molan, Branko Blagojević, Ivan Arizanović, Svetlana Milenković

Collaborators

Project background

Within the CERN IT department, a dedicated group is responsible for the operation and development of the storage infrastructure. This infrastructure is used to store the physics data generated by the experiments at CERN, as well as the files of all members of personnel.

EOS is a disk-based, low-latency storage service developed at CERN. It is tailored to handle large data rates from the experiments, while also running concurrent complex production workloads. This high- performance system now provides more than 350 petabytes of raw disks.

EOS is also the key storage component behind CERNBox, CERN’s cloud-storage service. This makes it possible to sync and share files on all major mobile and desktop platforms (Linux, Windows, macOS, Android, iOS), with the aim of providing offline availability to any data stored in the EOS infrastructure.

Recent progress

In 2020, Comtrade's team continued to improve and update the EOS technical documentation. Due to the COVID pandemic and the travel restrictions during the year, the hosting of dedicated hardware resources at CERN, to support the prototyping of an EOS-based appliance, was postponed.

The EOS-Comtrade team therefore focused their efforts on the development of a native way to interact from the Windows environment to the distributed storage system. Currently, Windows users access EOS only via dedicated gateways, where the system is exposed with SAMBA shares.

In order to provide more performant and seamless access to EOS for Windows users, Comtrade started developing a dedicated Windows client, codenamed EOS-wnc.

This new, dedicated, command-line Windows client was developed against the EOS server APIs. It now covers all EOS commands from the standard user listing and copy commands to the administrative ones.

On top of this, some EOS commands on Windows have additional functionalities, like improved tab-completion and history features.

Next steps

Together, we will focus on improving the current EOS Windows native client, adding new functionalities, and integrating it in the Windows user interface, developing a dedicated driver.

It is still planned to host dedicated hardware resources at CERN to support prototyping of an EOS-based appliance. This will enable Comtrade to create a first version of a full storage solution and to offer it to potential customers in the future.

Publications

    X. Espinal, M. Lamanna, From Physics to industry: EOS outside HEP. Published in Journal of Physics: Conference Series, Vol. 898, 2017. cern.ch/go/7XWH

Presentations

    L. Mascetti, Comtrade EOS productization (23 January). Presented at CERN openlab technical workshop, Geneva, 2019. cern.ch/go/W6SQ
    G. Molan, EOS Documentation and Tesla Data Box (4 February). Presented at CERN EOS workshop, Geneva, 2019. cern.ch/go/9QbM
    L. Mascetti, EOS Comtrade project (23 January). Presented at CERN openlab Technical workshop, Geneva, 2020. cern.ch/go/l9gc
    L. Mascetti, CERN Disk Storage Services (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/pF97
    G. Molan, Preparing EOS for Enterprise Users (27 January 2020). Presented at Cloud Storage Services for Synchronization and Sharing (CS3), Copenhagen, 2020. cern.ch/go/tQ7d
    G. Molan, EOS Documentation for Enterprise Users (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/swX8
    G. Molan, EOS Windows Native Client (3 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/P7DX
    G. Molan, EOS Storage Appliance Prototype (5 February 2020). Presented at CERN EOS workshop, Geneva, 2020. cern.ch/go/q8qh
    G. Molan, EOS-wnc demo (30 October 2020). Presented at IT-ST-PDSComtrade workshop, Geneva, 2020. cern.ch/go/q8qh
    L. Mascetti, EOS Open Storage for Science (7 December). Presented at Expo, Dubai, 2021.

Oracle Management Cloud

Project goal

We are testing Oracle Management Cloud (OMC) and providing feedback to Oracle, including proposals for the evolution of the platform. We are assessing the merits and suitability of this technology for applications related to databases at CERN, comparing it with our current on-premises infrastructure.

R&D topic
Data-centre technologies and infrastructures
Project coordinator(s)
Eva Dafonte Perez, Eric Grancher
Team members
Aimilios Tsouvelekakis
Collaborator liaison(s)
Simone Indelicato, Vincent Leocorbo, Cristobal Pedregal-Martin, David Ebert, Dmitrij Dolgušin

Collaborators

Project background

The group responsible for database services within CERN’s IT department uses and provides specialised monitoring solutions to teams across the laboratory that use database infrastructure. Since the beginning of 2018, we have an agreement in place with Oracle to test OMC, which offers a wide variety of monitoring solutions.

At CERN, as at other large organisations, it is very important to be able to monitor at all times what is happening with systems and applications running both locally and in the cloud. Thus, we conducted tests of OMC involving hosts, application servers, databases, and Java applications

Recent progress

Improvements proposed to Oracle during the previous year were implemented within new releases of the platform in 2019. Initial investigation shows that the platform has been enhanced with features covering most of our needs.

Furthermore, we deployed the registration application for the CERN Open Days in Oracle Cloud Infrastructure (see project ‘Developing a ticket reservation system for the CERN Open Days 2019’ for further details). The application made use of the Oracle Autonomous Transaction Processing (ATP) database to store visitor information. The behaviour of the ATP database was monitored using OMC, providing meaningful insights into the stresses put on the database and the database-hosting system during the period in which registration for the event was open.

Next steps

The next step for OMC is to use it as a monitoring platform for all the projects that are to be deployed in the Oracle Cloud Infrastructure.


Presentations

    A. Tsouvelekakis, Oracle Management Cloud: A unified monitoring platform (23 January). Presented at CERN openlab Technical Workshop, Geneva, 2019. cern.ch/go/Z7j9
    A. Tsouvelekakis, Enterprise Manager and Management Cloud CAB (April). Presented at Oracle Customer Advisory Board, Redwood Shores, 2019. cern.ch/go/tZD8
    A. Tsouvelekakis, CERN: Monitoring Infrastructure with Oracle Management Cloud (September). Presented at Oracle OpenWorld 2019, San Francisco, 2019. cern.ch/go/mMd8