Please note: All times listed below are in Central Time Zone
Driven by the demand to support the world's largest particle collider, the CERN IT department decided in 2012 to radically change and to build up an "Agile Infrastructure" -- centered around an OpenStack based private cloud. Since then, the CERN cloud has grown to ~300k cores and supports not only the physics programme, but also the majority of administrative and support services.
In this 5-year perspective, we will review some of our operational war stories. Concepts to simplify day-to-day operations, such as automating/outsourcing tasks via a job scheduler/orchestrator or the introduction of staged rollouts to mitigate deployment risks will be presented alongside experiences from cloud-wide campaigns, such as the handling of security vulnerabilities, the mass-migration of guests due to hardware retirements, or the elimination of a physical/virtual performance gap. The solutions to puzzling issues, such as intermittent VM shutdowns or data loss on reboots, will also be unveiled.
Attendees should expect to
- get a status overview of the current architecture of the CERN OpenStack deployment;
- learn the techniques and tools we use for daily operations and which allowed the service to scale;
- understand the way we organise cloud-wide campaigns that affect several thousand users (illustrated by concrete examples, such as the roll-out of security patches and a corresponding complete infrastructure restart);
- have some fun with "exotic" problems we encountered (such as being haunted by mysterious VM shutdowns or unexpected complete data loss on Cinder volumes upon instance reboot)!