Remote Otter LogoRemoteOtter

Senior Site Reliability Engineer - Control Plane - Remote

Posted 1 week ago
DevOps / Sysadmin
Full Time
CA, USA
$245,000 - $385,000/year

Overview

In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences. We began as an AI company built by AI engineers. That hasn't changed. Today, we're on a mission to be the world's top AI computing platform. We equip engineers with the tools to deploy AI that is fast, secure, affordable, and built to scale. Whether they need powerhouse GPU hardware on-site or the flexibility of cloud-based solutions, we've got the horsepower to make it happen. Lambda’s AI Cloud has been adopted by the world’s leading companies and research institutions including Anyscale, Rakuten, The AI Institute, and multiple enterprises with over a trillion dollars of market capitalization. Our goal is to make computation as effortless and ubiquitous as electricity.

In Short

  • Design and implement cloud-native architectures that deliver the "four nines" (99.99%) of reliability while balancing performance and cost efficiency.
  • Develop comprehensive monitoring and alerting systems with actionable dashboards that provide real-time visibility into system health.
  • Implement SLIs, SLOs, and SLAs across services and maintain error budgets to guide development priorities.
  • Automate deployments using tools like Argo and Terraform.
  • Create robust incident management processes, escalation paths, and documentation.
  • Architect fault-tolerant systems with graceful degradation capabilities to handle component failures.
  • Design and implement disaster recovery solutions with regular testing procedures.
  • Lead post-incident reviews that focus on systemic improvements rather than individual blame.
  • Champion reliability best practices and system design principles.
  • Build automated, auditable, and compliant processes to improve efficiency and productivity.

Requirements

  • 5+ years of experience in Site Reliability Engineering or DevOps roles.
  • Strong understanding of cloud platforms (AWS, GCP, Azure) and their core services.
  • Experience designing and implementing monitoring and observability solutions at scale.
  • Proven track record managing production incidents and driving root cause analysis.
  • Proficiency with Infrastructure as Code tools and CI/CD pipeline implementation.
  • Strong understanding of network architecture, load balancing, and content delivery.
  • Expertise in performance tuning and system optimization techniques.
  • Experience with container orchestration platforms like Kubernetes.
  • Knowledge of database administration and optimization strategies.
  • Solid coding skills in at least one language (Python, Go, Bash) for automation.

Benefits

  • Founded in 2012, ~350 employees (2024) and growing fast.
  • We offer generous cash & equity compensation.
  • Health, dental, and vision coverage for you and your dependents.
  • Commuter/Work from home stipends for select roles.
  • 401k Plan with 2% company match (USA employees).
  • Flexible Paid Time Off Plan that we all actually use.
Lambda logo

Lambda

Founded in 2012, Lambda is a rapidly growing AI computing platform that originated from a team of AI engineers dedicated to advancing machine learning. The company focuses on providing engineers with robust tools for deploying AI solutions that are fast, secure, and scalable, whether through powerful on-site GPU hardware or flexible cloud-based options. Lambda's AI Cloud is trusted by leading companies and research institutions, aiming to make computation as accessible and essential as electricity. With a commitment to innovation and high demand for its systems, Lambda offers competitive compensation, comprehensive benefits, and a collaborative work environment.

Share This Job!

Save This Job!

Similar Jobs:

Visa logo

Senior Site Reliability Engineer - Remote

Visa

6 days ago

Join Visa as a Senior Site Reliability Engineer to support critical application pipelines and enhance data operations.

TX, USA
Full-time
DevOps / Sysadmin
$134,285 - $164,100/year
BenchSci logo

Senior Site Reliability Engineer - Remote

BenchSci

1 week ago

Join our team as a Senior Site Reliability Engineer, where you will enhance our platform's reliability and observability.

CA
Full-time
DevOps / Sysadmin
Weekday AI logo

Senior Site Reliability Engineer - Remote

Weekday AI

1 week ago

Seeking a Senior Site Reliability Engineer to design and implement solutions that enhance system reliability and performance.

India
Full-time
DevOps / Sysadmin
2000000 - 3000000 INR/year
Curve logo

Senior Site Reliability Engineer - Remote

Curve

1 week ago

Join Curve as a Senior Site Reliability Engineer to support and enhance their infrastructure and services.

Worldwide
Full-time
DevOps / Sysadmin
Underdog Sports logo

Senior Site Reliability Engineer - Remote

Underdog Sports

1 week ago

Join Underdog as a Senior Site Reliability Engineer to manage incident response and enhance cloud infrastructure.

Worldwide
Full-time
DevOps / Sysadmin
$150,000 - $180,000/year