Job Description
As the SRE Manager, you will lead and manage our SRE team, working closely with cross-functional teams to establish and enhance our reliability engineering practices. You will be responsible for driving the continuous improvement of our systems' reliability, scalability, and efficiency, while also ensuring prompt incident response and effective problem resolution. In addition, you will play a key role in setting and achieving service level objectives (SLOs) and driving the adoption of best practices for monitoring, alerting, and automation. The Manager of SRE is a hands-on technical role and requires a thorough understanding of all components of a modern web application stack, including front-end, backend, database, networking, and systems-level knowledge.
Responsibilities:
- Lead and mentor the SRE team, fostering a collaborative and high-performing culture
- Establish and refine SRE and DevOps practices, processes, and methodologies within the organization
- Drive the automation of build, release, and deployment processes
- Collaborate with development, operations, and product teams to optimize the reliability, scalability, and performance of our systems
- Define and monitor service level objectives (SLOs) to ensure the availability and performance of our services
- Implement effective incident management and problem resolution processes, ensuring minimal impact to customers
- Develop and maintain monitoring and alerting systems to proactively identify and mitigate potential issues
- Drive automation efforts to streamline deployments, infrastructure provisioning, and operational tasks
- Perform post-incident reviews to identify root causes, implement preventive measures, and share lessons learned
- Stay up to date with industry trends and emerging technologies, and assess their potential impact on our SRE practices
Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- 2-6 years of experience in Site Reliability Engineering or a related role, with demonstrated experience in leading and managing teams
- Strong knowledge of SRE and DevOps principles, practices, and methodologies
- Proficiency in scripting and automation using tools such as Python, NodeJS, or other langugages
- Experience with cloud platforms (AWS, Azure, GCP) and infrastructure-as-code (IaC) tools like Terraform
- Expertise in monitoring and observability tools (e.g., Prometheus, Datadog, New Relic, ELK stack)
- Expertise with containerization technologies (Docker, Kubernetes
- Familiarity with incident response and post-incident analysis processes
- Strong analytical and problem-solving skills
- Excellent communication and leadership ability