Site Reliability Engineer — GPU Infrastructure - Remote

Posted 44 weeks ago

DevOps / Sysadmin

Contract

USA

Kubernetes

Infrastructure-as-Code

Overview

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

In Short

Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.
Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.
Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.
Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.
Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.
Optimize high‑performance networking (InfiniBand/RDMA) and debug performance bottlenecks.
Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.
Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Requirements

BS/MS/PhD in CS, EE, or related field.
3+ years SRE/DevOps in production; 2+ years managing large Kubernetes fleets.
Expert‑level Kubernetes experience.
Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator).
GPU schedulers such as Slurm or Kueue.
Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).
Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Benefits

Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.
Familiarity with CI/CD tooling (GitHub Actions, BuildKit).
Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Genmo

Genmo is a cutting-edge research lab focused on developing open, state-of-the-art models for video generation, with the goal of advancing artificial general intelligence (AGI). The company is dedicated to pushing the boundaries of what is possible in AI and video technology, inviting talented individuals to join their mission in shaping the future of this field.

Share This Job!

Save This Job!