Develop, deploy, and maintain the appropriate systems, services, and tooling in Datto’s production environment that provides constant feedback to stakeholders
Implement best practices promoting service availability/reliability and fault tolerance
Add the appropriate monitoring to the Datto Cloud infrastructure platform which detects and alerts along with potential remediation recommendations and/or taking corrective action
Scale the Datto Cloud infrastructure platform and reduce human intervention as needed by automating any repetitive operational activities and measuring normal operation of the platform
Collaborate with the Product and Software Development teams to determine the core products reliability strategy, including Service Level Objectives (SLOs) and Indicators (SLIs); ensure that service reliability best practices are a core tenet of all new software design and development
Collaborate and partner with SREs from the various software application engineering teams to ensure overall consistency and end to end reliability
Collect SLI metrics and establish monitoring based on SLO thresholds and other product requirements
Troubleshoot complex issues quickly and effectively; continually improve processes and reliability based on post-mortem analysis
Participate in a rotational on-call program and enhance troubleshooting techniques and utilities to ensure quick resolution to service impacting issues
Communicate with Users, Support, and Development teams in the event of an incident
Diagnose and develop root cause solutions for failures and performance issues in our production environment
Your Experience:
Bachelor’s degree in Computer Science or equivalent experience
Experience with software development, automation, infrastructure as code, and data-driven analysis
Experience with configuration management tools such as Puppet, Ansible, and Salt
Hands-on experience with mainline programming and scripting languages such as Bash, Python, Perl, Ruby
Familiar with standard tools and platforms that enable continuous delivery such as GitLab, Jenkins, Kubernetes, Docker, JIRA, and ServiceNow
Significant experience with virtualized and bare metal infrastructure; KVM and OpenStack experience strongly preferred
Experience with monitoring and management in public cloud is highly desired
Strong root cause analysis and troubleshooting competency
Strong tendency to automate and monitor everything
Excellent communication skills
Ability to operate in a fast paced environment
Self-motivated & willing to learn
Ability to work independently and as part of a team
Please apply using this link: https://www.datto.com/careers/job-board/post/2454607