Remote Otter LogoRemoteOtter

HPC Engineer - Research Infrastructure - Remote

Posted 36 weeks ago
DevOps / Sysadmin
Full Time
CA, USA

Overview

Luma's mission is to build multimodal AI by pushing the boundaries of what is possible with large-scale supercomputing. We are building some of the biggest and fastest AI clusters in the world, and this role is at the very heart of that effort. This requires a deep, first-principles understanding of how hardware and software intersect to unlock maximum performance.

In Short

  • Architect & Optimize Supercomputers: Design, build, and tune systems that combine CPUs, GPUs (NVIDIA and AMD), and high-performance networking into world-class clusters.
  • Master Low-Level Performance: Dive deep into the Linux OS, device drivers, and user-space code to optimize performance at every level of the stack.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
  • Manage HPC Schedulers: Architect and manage modern HPC job management frameworks like Kubernetes, designing queues and partitions setups to maximize throughput and utilization for mixed research workloads.
  • Build Automation for Scale: Write code to automate the monitoring, diagnostics, and healing of thousands of servers, enabling a massive infrastructure footprint with a small, elite team.

Requirements

  • 8+ years of experience as an Infrastructure, DevOps, or HPC engineer working on large, complex distributed systems.
  • Deep, hands-on experience managing and troubleshooting large GPU clusters from provisioning to monitoring.
  • Expert in high-performance networking, with practical experience in InfiniBand, RDMA, or RoCE.
  • Extensive knowledge of Linux systems, including performance tuning, debugging, and configuration.
  • Deep understanding of modern HPC job management systems based on Kubernetes, and familiar with workflow orchestration frameworks like Ray or Flyte.
  • Experience architecting, building, and maintaining large-scale Kubernetes clusters from first principles.
  • Independently driven, tenacious problem-solver who can own issues from end-to-end.

Benefits

  • Experience at national labs, research universities, or companies known for their large-scale, on-prem supercomputing infrastructure.
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs, like DCGM or ROCm.
Luma AI logo

Luma AI

Luma Ai is dedicated to advancing the field of artificial intelligence through the development of multimodal systems that enhance human creativity and capabilities. The company believes that integrating various forms of data, particularly visual information, is essential for creating more intelligent and interactive AI systems. Luma Ai focuses on training and scaling multimodal foundation models that can perceive, understand, and engage with the world, aiming to deliver high-performance AI solutions across diverse hardware platforms.

Share This Job!

Save This Job!

Similar Jobs:

AHEAD logo

HPC Infrastructure Engineer - Remote

AHEAD

37 weeks ago

The HPC Infrastructure Engineer is responsible for maintaining and optimizing high-performance computing infrastructure for managed services customers.

USA
Full-time
DevOps / Sysadmin
BHFT logo

Senior Infrastructure Research Engineer - Remote

BHFT

24 weeks ago

Join BHFT as a Senior Infrastructure Research Engineer to manage telecom channels and cloud infrastructure in a fully remote environment.

Worldwide
Full-time
DevOps / Sysadmin
Cohere logo

Senior Search Infrastructure Engineer - Remote

Cohere

39 weeks ago

Join Cohere as a Senior Search Infrastructure Engineer to support and enhance their search platform.

Canada, United States, United Kingdom
Full-time
Software Development
Gensyn logo

Infrastructure Engineer - Remote

Gensyn

7 weeks ago

The Infrastructure Engineer will support the development and deployment of machine intelligence protocols in a fully remote environment.

Worldwide
Full-time
DevOps / Sysadmin
Anaplan logo

Infrastructure Engineer - Remote

Anaplan

7 weeks ago

Anaplan is looking for an Infrastructure Engineer to optimize their SaaS product delivery and enhance their technical stack.

Worldwide
Full-time
Software Development