Remote Otter LogoRemoteOtter

Infrastructure Engineer (InfiniBand / NCCL) - Remote

Posted 3 weeks ago
DevOps / Sysadmin
Full Time
USA

Overview

We are seeking an Infrastructure Engineer with a focus on InfiniBand/NCCL to join our Infrastructure Engineering team. Our engineers design and build automation, tooling, and systems that bridge the gap between physical infrastructure and the platforms that power large-scale AI/ML and HPC workloads.

This role combines the breadth of a core infrastructure engineer with a specialty in high-performance networking and GPU communication. You’ll help ensure our InfiniBand fabric and NCCL stack are tuned, reliable, and efficient at scale — supporting some of the world’s largest GPU clusters.

This is a fully remote position, although candidates must be based in the continental United States. Unfortunately, we are unable to provide sponsorship for this role.

In Short

  • Design, build, and maintain automation, APIs, and frameworks to manage physical infrastructure at scale.
  • Develop and extend systems for server lifecycle management.
  • Implement and tune InfiniBand networking and NCCL configurations for multi-GPU communication.
  • Collaborate with Network, Platform, and Infrastructure Operations teams to support new infrastructure rollouts.
  • Diagnose and improve performance across GPU, NVSwitch, PCIe, and InfiniBand layers.
  • Write clear design documents and technical documentation to capture best practices.

Requirements

  • 8+ years of professional experience in infrastructure engineering, HPC, or related domains.
  • Strong experience with Linux in production environments.
  • Proficiency in Python or similar languages for automation.
  • Deep understanding of InfiniBand networking (CX7 HCAs, fabrics, partitioning, GPUDirect).
  • Familiarity with NCCL, CUDA, and GPU topology optimization.
  • Knowledge of containerization and orchestration concepts.
  • Strong written and verbal communication skills.

Benefits

  • Enjoy collaborating with a motivated, execution-focused team.
  • Comfortable operating with autonomy while aligning to company objectives.
  • Value precision, documentation, and knowledge-sharing.
  • Excited to grow as both a domain specialist (InfiniBand/NCCL) and a generalist infrastructure engineer.
Voltage Park logo

Voltage Park

Voltage Park is a pioneering company dedicated to democratizing access to machine learning infrastructure for a diverse range of clients, including large enterprises, research universities, seed-stage startups, and nonprofits. The company stands out as the only cloud provider that offers a platform showcasing all available GPUs for rent, complete with transparent, market-based pricing and long-term reserve contracts. As a rapidly growing startup in the AI infrastructure sector, Voltage Park is committed to providing seamless compute access and fostering innovation in the field of artificial intelligence.

Share This Job!

Save This Job!

Similar Jobs:

Join Descript as an Infrastructure Engineer to enhance the reliability and performance of core production infrastructure.

CA, USA
Full-time
DevOps / Sysadmin
$191K - $232K/year
Hatch IT logo

Infrastructure Engineer - Remote

Hatch IT

4 weeks ago

Join Lastwall as an Infrastructure Engineer to enhance and maintain secure, scalable infrastructure in a cloud-native environment.

Worldwide
Full-time
DevOps / Sysadmin
Libertex Group logo

Infrastructure Engineer - Remote

Libertex Group

5 weeks ago

Join Libertex Group as an Infrastructure Engineer to design and maintain secure AWS infrastructure with a focus on automation.

Serbia
Full-time
DevOps / Sysadmin

Roboflow

Infrastructure Engineer - Remote

Roboflow

6 weeks ago

Join Roboflow as an Infrastructure Engineer to design and maintain robust cloud infrastructure for AI-driven applications.

USA
Full-time
DevOps / Sysadmin
$180,000 - $200,000/year
Hexa People logo

Infrastructure Engineer - Remote

Hexa People

7 weeks ago

Join our team as an Infrastructure Engineer responsible for managing and optimizing cloud infrastructure.

Worldwide
Full-time
DevOps / Sysadmin