Datto, the world’s leading provider of IT solutions delivered through managed service providers, is looking for a Staff Site Reliability Engineer to join a growing team. Datto is a creative company at its core and is an exciting and dynamic workplace. We're 100% focused on our managed service provider partners and believe that with the right technology, managed service providers can change how businesses around the world operate. Datto provides data protection, business continuity, networking, business management, and file backup and sync products that empower and protect the clients of our 17,000+ partners. We're headquartered in Norwalk, Connecticut and have 22 offices worldwide. Learn more at datto.com.
More than someone who checks every box, we’re looking for people who are excited to work and grow at Datto. If that's you we hope you apply for the role!
You enjoy teamwork
You come with new ideas and a unique point of view. You look forward to collaborating with a diverse team. You eagerly seek and give help. Transparency tops your list of values, and you contribute to a culture of respect and inclusion.
Inquisitive and focused, you see every challenge as an opportunity. You would rather create the future than wait for it.
You’re customer-focused and take pride in your work.
You put extra attention into details with all you do. You care about the work you provide to customers and how it reflects on yourself and Datto. When you find or see something wrong, you attempt to resolve it. You look for opportunities to not only better yourself, but others around you. You aim to be the best that you can be and always do the right thing.
What You’ll Do
The Staff Site Reliability Engineer will ensure the overall system reliability, uptime, health, and performance of the Datto Cloud infrastructure. You’ll work closely with various stakeholders to understand the architecture and design of the Datto Cloud in order to help quickly resolve service impacting issues, detect and self-heal problems before they become service impacting, and provide valuable information and data back to the individual teams in order to improve the long-term reliability of the platform.
Your job function and responsibilities include:
- Develop, deploy, and maintain the appropriate systems, services, and tooling in Datto’s production environment that provides constant feedback to stakeholders
- Implement best practices promoting service availability/reliability and fault tolerance
- Add the appropriate monitoring to the Datto Cloud infrastructure platform which detects and alerts along with potential remediation recommendations and/or taking corrective action
- Scale the Datto Cloud infrastructure platform and reduce human intervention as needed by automating any repetitive operational activities and measuring normal operation of the platform
- Collaborate with the Product and Software Development teams to determine the core products reliability strategy, including Service Level Objectives (SLOs) and Indicators (SLIs); ensure that service reliability best practices are a core tenet of all new software design and development
- Collaborate and partner with SREs from the various software application engineering teams to ensure overall consistency and end to end reliability
- Collect SLI metrics and establish monitoring based on SLO thresholds and other product requirements
- Troubleshoot complex issues quickly and effectively; continually improve processes and reliability based on post-mortem analysis
- Participate in a rotational on-call program and enhance troubleshooting techniques and utilities to ensure quick resolution to service impacting issues
- Communicate with Users, Support, and Development teams in the event of an incident
- Diagnose and develop root cause solutions for failures and performance issues in our production environment
- Bachelor’s degree in Computer Science or equivalent experience
- Experience with software development, automation, infrastructure as code, and data-driven analysis
- Experience with configuration management tools such as Puppet, Ansible, and Salt
- Hands-on experience with mainline programming and scripting languages such as Bash, Python, Perl, Ruby
- Familiar with standard tools and platforms that enable continuous delivery such as GitLab, Jenkins, Kubernetes, Docker, JIRA, and ServiceNow
- Significant experience with virtualized and bare metal infrastructure; KVM and OpenStack experience strongly preferred
- Experience with monitoring and management in public cloud is highly desired
- Strong root cause analysis and troubleshooting competency
- Strong tendency to automate and monitor everything
- Excellent communication skills
- Ability to operate in a fast paced environment
- Self-motivated & willing to learn
- Ability to work independently and as part of a team