Remote Otter LogoRemoteOtter

HPC Engineer - Research Infrastructure - Remote

Posted 30 weeks ago
DevOps / Sysadmin
Full Time
CA, USA

Overview

Luma's mission is to build multimodal AI by pushing the boundaries of what is possible with large-scale supercomputing. We are building some of the biggest and fastest AI clusters in the world, and this role is at the very heart of that effort. This requires a deep, first-principles understanding of how hardware and software intersect to unlock maximum performance.

In Short

  • Architect & Optimize Supercomputers: Design, build, and tune systems that combine CPUs, GPUs (NVIDIA and AMD), and high-performance networking into world-class clusters.
  • Master Low-Level Performance: Dive deep into the Linux OS, device drivers, and user-space code to optimize performance at every level of the stack.
  • Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues, often collaborating directly with hardware vendors like NVIDIA.
  • Manage HPC Schedulers: Architect and manage modern HPC job management frameworks like Kubernetes, designing queues and partitions setups to maximize throughput and utilization for mixed research workloads.
  • Build Automation for Scale: Write code to automate the monitoring, diagnostics, and healing of thousands of servers, enabling a massive infrastructure footprint with a small, elite team.

Requirements

  • 8+ years of experience as an Infrastructure, DevOps, or HPC engineer working on large, complex distributed systems.
  • Deep, hands-on experience managing and troubleshooting large GPU clusters from provisioning to monitoring.
  • Expert in high-performance networking, with practical experience in InfiniBand, RDMA, or RoCE.
  • Extensive knowledge of Linux systems, including performance tuning, debugging, and configuration.
  • Deep understanding of modern HPC job management systems based on Kubernetes, and familiar with workflow orchestration frameworks like Ray or Flyte.
  • Experience architecting, building, and maintaining large-scale Kubernetes clusters from first principles.
  • Independently driven, tenacious problem-solver who can own issues from end-to-end.

Benefits

  • Experience at national labs, research universities, or companies known for their large-scale, on-prem supercomputing infrastructure.
  • Deep expertise with GPU tooling for NVIDIA and AMD GPUs, like DCGM or ROCm.
Luma AI logo

Luma AI

Luma AI is dedicated to advancing multimodal artificial intelligence to enhance human creativity and capabilities. The company believes that integrating various modalities is essential for developing intelligent systems that surpass traditional language models. Luma AI focuses on training and scaling multimodal foundation models that can perceive, comprehend, and interact with the world, aiming to create systems that are not only aware but also capable of effecting meaningful change. The team is committed to optimizing performance across diverse hardware platforms, ensuring that their state-of-the-art models are accessible to a wide audience at the best performance-to-cost ratio.

Share This Job!

Save This Job!

Similar Jobs:

AHEAD logo

HPC Infrastructure Engineer - Remote

AHEAD

31 weeks ago

The HPC Infrastructure Engineer is responsible for maintaining and optimizing high-performance computing infrastructure for managed services customers.

USA
Full-time
DevOps / Sysadmin
BHFT logo

Senior Infrastructure Research Engineer - Remote

BHFT

17 weeks ago

Join BHFT as a Senior Infrastructure Research Engineer to manage telecom channels and cloud infrastructure in a fully remote environment.

Worldwide
Full-time
DevOps / Sysadmin
Cohere logo

Senior Search Infrastructure Engineer - Remote

Cohere

32 weeks ago

Join Cohere as a Senior Search Infrastructure Engineer to support and enhance their search platform.

Canada, United States, United Kingdom
Full-time
Software Development
Gensyn logo

Infrastructure Engineer - Remote

Gensyn

4 days ago

The Infrastructure Engineer will support the development and deployment of machine intelligence protocols in a fully remote environment.

Worldwide
Full-time
DevOps / Sysadmin
Anaplan logo

Infrastructure Engineer - Remote

Anaplan

4 days ago

Anaplan is looking for an Infrastructure Engineer to optimize their SaaS product delivery and enhance their technical stack.

Worldwide
Full-time
Software Development