Remote Otter LogoRemoteOtter

Lead Cluster Operations Support Engineer - Remote

Posted 6 weeks ago
DevOps / Sysadmin
Full Time
USA
$125,330 - $208,880 USD/year

Overview

We are seeking a highly skilled Lead Cluster Operations Support Engineer with extensive experience in cloud infrastructure, Kubernetes, and GPU clusters. The ideal candidate will possess a strong background in operations, cloud architecture, and managing large-scale environments, particularly in the context of machine learning model training and high-performance computing.

In Short

  • You will help shape and iterate this new white glove model training support service on large GPU clusters.
  • You will work in a collaborative team with Machine Learning Engineers and Infrastructure Engineers.
  • You will contribute to accelerator development: find gaps in the tooling, or needed automation, or patterns we would develop accelerators to make the next round of this more efficient and faster.
  • You will help assess model training readiness and data preparation.
  • You will provide model training support during rotating daytime weekend shifts.
  • You will facilitate collaborative problem-solving within the team.
  • You will proactively identify and address challenges related to the white glove service for continued pre-training.

Requirements

  • Deep expertise in Kubernetes administration and debugging at scale.
  • Extensive experience managing large clusters with thousands of nodes using Kubernetes.
  • Knowledge of running training workloads on thousands of GPUs.
  • Familiarity with the Lustre filesystem is a plus.
  • Experience working with the NVIDIA NeMo Framework.
  • Proficiency with cloud platforms such as GCP, AWS, and Azure.
  • Experience with Terraform/Pulumi, Helm Charts, and Infrastructure-as-Code tools.

Benefits

  • Support for career development and learning opportunities.
  • Hybrid working model with remote work options.
  • Equal opportunity employer policies.
Referrals Only logo

Referrals Only

Thoughtworks is a global technology consultancy that specializes in integrating strategy, design, and engineering to drive digital innovation. With over 30 years of experience, Thoughtworks has built a reputation for delivering impactful technology solutions to clients in various sectors, including banking and financial services. The company fosters a collaborative culture where diverse teams, including computer science graduates and seasoned technologists, work together to challenge conventional thinking and create innovative solutions. Thoughtworks is committed to supporting the career development of its employees through interactive tools and numerous development programs, making it a dynamic environment for personal and professional growth.

Share This Job!

Save This Job!

Similar Jobs:

Aethir logo

Operations Support Engineer - Remote

Aethir

28 weeks ago

Aethir is seeking a skilled Linux Systems Administrator to manage and optimize their mining infrastructure.

Taiwan
Full-time
DevOps / Sysadmin
Dijital Team Pty logo

Support Operations Engineer - L2 - Remote

Dijital Team Pty

86 weeks ago

Join Dijital Team as a Support Operations Engineer to ensure seamless IT service operations.

Sri Lanka
Full-time
DevOps / Sysadmin
Planet.fans logo

Operational Support Engineer - Remote

Planet.fans

8 weeks ago

Join us as an Operational Support Engineer to ensure our platform runs smoothly during key live events.

Worldwide
Full-time
DevOps / Sysadmin
Data Center logo

Operational Support Engineer - Remote

Data Center

18 weeks ago

Join DCI as an Operational Support Engineer to provide critical technical support in Linux and Oracle environments.

Worldwide
Full-time
DevOps / Sysadmin
$77,000 - $85,000/year

Hyva

Operations and Support Engineer - China - Remote

Hyva

14 weeks ago

Join Hyva Group as an Operations and Support Engineer in China, providing 3rd-line support and training for digital products.

China
Full-time
DevOps / Sysadmin