Remote Otter LogoRemoteOtter

Site Reliability Engineering (SRE) Manager - Remote

Posted 7 weeks ago
DevOps / Sysadmin
Full Time
USA
$180,000 - $210,000/year

Overview

RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.

In Short

  • Lead and mentor a team of Site Reliability Engineers.
  • Develop and implement strategic plans for infrastructure reliability and scalability.
  • Collaborate with cross-functional teams on SRE initiatives.
  • Establish SLIs, SLOs, and SLAs for critical systems.
  • Drive best practices in automation and incident response.
  • Manage large-scale bare-metal fleets across data centers.
  • Ensure robust security practices in infrastructure.
  • Manage on-call rotations and critical incident leadership.
  • Contribute to capacity planning for infrastructure growth.
  • Participate in hiring and team growth initiatives.

Requirements

  • 5+ years of experience in Site Reliability Engineering.
  • 3+ years in a technical leadership or management role.
  • Deep understanding of Linux systems and networking technologies.
  • Experience managing large-scale distributed systems.
  • Expertise in infrastructure-as-code tools.
  • Proficiency in Python or Golang.
  • Experience with cloud platforms (AWS, GCP, Azure).
  • Strong knowledge of monitoring and observability systems.
  • Excellent problem-solving skills.
  • Strong communication skills.

Benefits

  • Competitive base pay ranging from $180,000 - $210,000.
  • Stock options.
  • Flexibility of remote work.
  • Opportunity to grow with an innovative company.
  • Generous vacation policy.
  • Contribute to a company with a global impact.

RunPod

RunPod

RunPod is a pioneering platform that empowers developers to build, run, and scale AI models efficiently. With the ability to deploy AI models to 37 global data centers in just 78 seconds, RunPod has become the go-to choice for over 100,000 developers looking to enhance their applications with AI capabilities. The company is focused on creating a robust PaaS ecosystem that bridges frontend applications and cloud systems, ensuring seamless interaction and scalability. RunPod is committed to innovation, user-centric design, and fostering a diverse and inclusive workplace.

Share This Job!

Save This Job!

Similar Jobs:

Vercel logo

Site Reliability Engineering (SRE) Manager - Remote

Vercel

9 weeks ago

Vercel is seeking a Site Reliability Engineering (SRE) Manager to lead their SRE team and ensure high standards of quality and reliability across engineering.

USA
Full-time
DevOps / Sysadmin
$220,000 - $330,000/year

Klaviyo

Site Reliability Engineering Manager - Remote

Klaviyo

7 weeks ago

The Site Reliability Engineering Manager will lead a team to enhance system reliability and productivity at Klaviyo.

USA
Full-time
DevOps / Sysadmin
$188,000 - $282,000 USD
TextNow logo

Site Reliability Engineering Manager - Remote

TextNow

10 weeks ago

Join TextNow as a Site Reliability Engineering Manager to lead a critical team and enhance system reliability and performance.

USA, CA
Full-time
DevOps / Sysadmin
Customer.io logo

Engineering Manager - Site Reliability Engineering - Remote

Customer.io

10 weeks ago

Join Customer.io as an Engineering Manager to lead the SRE squad and ensure the reliability of their products.

Worldwide
Full-time
DevOps / Sysadmin
$140,000 - $190,000/year
Axon logo

Manager, Site Reliability Engineering - Remote

Axon

11 weeks ago

Axon is seeking a Manager for Site Reliability Engineering to lead a team in managing large-scale cloud platforms.

Canada
Full-time
DevOps / Sysadmin