Remote Otter LogoRemoteOtter

Lead Cluster Operations Support Engineer - Remote

Posted 2 days ago

Overview

We are seeking a highly skilled Lead Cluster Operations Support Engineer with extensive experience in cloud infrastructure, Kubernetes, and GPU clusters. The ideal candidate will possess a strong background in operations, cloud architecture, and managing large-scale environments, particularly in the context of machine learning model training and high-performance computing.

In Short

  • You will help shape and iterate this new white glove model training support service on large GPU clusters.
  • You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster.
  • You will help assess model training readiness and data preparation.
  • You will provide model training support during rotating daytime weekend shifts.
  • You will facilitate collaborative problem-solving within the team.
  • You will proactively identify and address challenges related to the white glove service for continued pre-training.

Requirements

  • Deep expertise in Kubernetes administration and debugging at scale.
  • Extensive experience managing large clusters with thousands of nodes using Kubernetes.
  • Knowledge of running training workloads on thousands of GPUs.
  • Familiarity with the Lustre filesystem is a plus.
  • Experience working with the NVIDIA NeMo Framework.
  • Proficiency with cloud platforms such as GCP, AWS, and Azure.
  • Experience with Terraform/Pulumi, Helm Charts, and Infrastructure-as-Code tools.

Benefits

  • Support for career development and learning opportunities.
  • Hybrid working model with remote work options.
  • Equal opportunity employer policies.

Similar Jobs:

Aethir logo

Operations Support Engineer - Remote

Aethir

21 weeks ago

Aethir is seeking a skilled Linux Systems Administrator to manage and optimize their mining infrastructure.

Linux
Systems Administration
Mining Infrastructure
Performance Monitoring
Taiwan
Full-time
DevOps / Sysadmin
Dijital Team Pty logo

Support Operations Engineer - L2 - Remote

Dijital Team Pty

80 weeks ago

Join Dijital Team as a Support Operations Engineer to ensure seamless IT service operations.

IT Support
Desktop Support
Service Operations
Remote Collaboration
Sri Lanka
Full-time
DevOps / Sysadmin
Planet.fans logo

Operational Support Engineer - Remote

Planet.fans

2 weeks ago

Join us as an Operational Support Engineer to ensure our platform runs smoothly during key live events.

Operational Support
Infrastructure Monitoring
AWS
Troubleshooting
Worldwide
Full-time
DevOps / Sysadmin
Data Center logo

Operational Support Engineer - Remote

Data Center

12 weeks ago

Join DCI as an Operational Support Engineer to provide critical technical support in Linux and Oracle environments.

Linux
Oracle
SQL
Bash
Worldwide
Full-time
DevOps / Sysadmin
$77,000 - $85,000/year

Hyva

Operations and Support Engineer - China - Remote

Hyva

8 weeks ago

Join Hyva Group as an Operations and Support Engineer in China, providing 3rd-line support and training for digital products.

3rd-line Support
Cloud Platforms
Linux
Embedded Systems
China
Full-time
DevOps / Sysadmin