Remote Otter LogoRemoteOtter

Site Reliability Engineer — GPU Infrastructure - Remote

Posted 5 weeks ago
DevOps / Sysadmin
Contract
USA

Overview

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

In Short

  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.
  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.
  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.
  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
  • Optimize high‑performance networking (InfiniBand/RDMA) and debug performance bottlenecks.
  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.
  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Requirements

  • BS/MS/PhD in CS, EE, or related field.
  • 3+ years SRE/DevOps in production; 2+ years managing large Kubernetes fleets.
  • Expert‑level Kubernetes experience.
  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator).
  • GPU schedulers such as Slurm or Kueue.
  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).
  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Benefits

  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.
  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).
  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.
Genmo logo

Genmo

Genmo is a cutting-edge research lab focused on developing open, state-of-the-art models for video generation, with the goal of advancing artificial general intelligence (AGI). The company is dedicated to pushing the boundaries of what is possible in AI and video technology, inviting talented individuals to join their mission in shaping the future of this field.

Share This Job!

Save This Job!

Similar Jobs:

Clerk logo

Infrastructure Engineer / Site Reliability Engineer (SRE) - Remote

Clerk

16 weeks ago

Clerk is seeking an experienced Infrastructure Engineer / SRE to manage and optimize their technology infrastructure.

USA
Full-time
DevOps / Sysadmin

N.U

Senior Site Reliability Engineer - Infrastructure - Remote

NVIDIA USA

4 weeks ago

Join NVIDIA as a Senior Site Reliability Engineer to design and maintain large-scale production systems with a focus on reliability and efficiency.

Switzerland
Full-time
DevOps / Sysadmin
Life360 logo

Senior II Site Reliability Engineer, Infrastructure - Remote

Life360

35 weeks ago

Join Life360 as a Senior II Site Reliability Engineer to build and maintain scalable infrastructure platforms in a remote-first environment.

Worldwide
Full-time
DevOps / Sysadmin
$147,500 - $173,500 CAD/year
Coinbase logo

Senior Site Reliability Engineer, Core AI Infrastructure - Remote

Coinbase

22 weeks ago

Join Coinbase as a Senior Site Reliability Engineer to enhance AI infrastructure and drive automation in a remote role.

USA
Full-time
Software Development
$186,065 - $218,900 USD/year
MLabs logo

Infrastructure, DevOps & Reliability Engineer - Remote

MLabs

3 weeks ago

Join high-growth startups as an Infrastructure, DevOps, and Reliability Engineer, focusing on cloud systems and reliability.

NY, USA
Full-time
DevOps / Sysadmin