Remote Otter LogoRemoteOtter

Site Reliability Engineering (SRE) Manager - Remote

Posted 21 weeks ago
DevOps / Sysadmin
Full Time
USA
$180,000 - $210,000/year

Overview

RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.

In Short

  • Lead and mentor a team of Site Reliability Engineers.
  • Develop and implement strategic plans for infrastructure reliability and scalability.
  • Collaborate with cross-functional teams on SRE initiatives.
  • Establish SLIs, SLOs, and SLAs for critical systems.
  • Drive best practices in automation and incident response.
  • Manage large-scale bare-metal fleets across data centers.
  • Ensure robust security practices in infrastructure.
  • Manage on-call rotations and critical incident leadership.
  • Contribute to capacity planning for infrastructure growth.
  • Participate in hiring and team growth initiatives.

Requirements

  • 5+ years of experience in Site Reliability Engineering.
  • 3+ years in a technical leadership or management role.
  • Deep understanding of Linux systems and networking technologies.
  • Experience managing large-scale distributed systems.
  • Expertise in infrastructure-as-code tools.
  • Proficiency in Python or Golang.
  • Experience with cloud platforms (AWS, GCP, Azure).
  • Strong knowledge of monitoring and observability systems.
  • Excellent problem-solving skills.
  • Strong communication skills.

Benefits

  • Competitive base pay ranging from $180,000 - $210,000.
  • Stock options.
  • Flexibility of remote work.
  • Opportunity to grow with an innovative company.
  • Generous vacation policy.
  • Contribute to a company with a global impact.

RunPod

RunPod

Runpod is a pioneering company founded in 2022 that is at the forefront of AI and machine learning, providing advanced cloud infrastructure for full-stack AI applications. As a rapidly growing and well-funded organization with a remote-first structure, Runpod aims to empower innovators and enterprises to harness the true potential of AI, driving technological advancements and transforming various industries. The company fosters a collaborative and inclusive culture, prioritizing learning and ownership among its team members while offering competitive compensation and benefits.

Share This Job!

Save This Job!

Similar Jobs:

Vercel logo

Site Reliability Engineering (SRE) Manager - Remote

Vercel

22 weeks ago

Vercel is seeking a Site Reliability Engineering (SRE) Manager to lead their SRE team and ensure high standards of quality and reliability across engineering.

USA
Full-time
DevOps / Sysadmin
$220,000 - $330,000/year

Klaviyo

Site Reliability Engineering Manager - Remote

Klaviyo

20 weeks ago

The Site Reliability Engineering Manager will lead a team to enhance system reliability and productivity at Klaviyo.

USA
Full-time
DevOps / Sysadmin
$188,000 - $282,000 USD
TextNow logo

Site Reliability Engineering Manager - Remote

TextNow

24 weeks ago

Join TextNow as a Site Reliability Engineering Manager to lead a critical team and enhance system reliability and performance.

USA, CA
Full-time
DevOps / Sysadmin
Customer.io logo

Engineering Manager - Site Reliability Engineering - Remote

Customer.io

24 weeks ago

Join Customer.io as an Engineering Manager to lead the SRE squad and ensure the reliability of their products.

Worldwide
Full-time
DevOps / Sysadmin
$140,000 - $190,000/year
Axon logo

Manager, Site Reliability Engineering - Remote

Axon

24 weeks ago

Axon is seeking a Manager for Site Reliability Engineering to lead a team in managing large-scale cloud platforms.

Canada
Full-time
DevOps / Sysadmin