Remote Otter LogoRemoteOtter

Senior Site Reliability Engineer, DGX Cloud - Remote

Posted Yesterday
DevOps / Sysadmin
Full Time
India

Overview

NVIDIA is seeking a Senior Site Reliability Engineer for its DGX Cloud team, responsible for maintaining high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.

In Short

  • Build, implement and support operational aspects of large-scale Kubernetes clusters.
  • Define SLOs/SLIs and monitor error budgets.
  • Support services before they launch through system creation consulting.
  • Maintain services once live by monitoring availability and system health.
  • Operate and optimize GPU workloads across major cloud providers.
  • Lead triage and root-cause analysis of high-severity incidents.
  • Participate in on-call rotation to support production services.
  • Work in a diverse, innovative environment at NVIDIA.
  • Utilize tools for monitoring, logging, and observability.
  • Engage in blameless postmortems for incident response.

Requirements

  • BS in Computer Science or related field, or equivalent experience.
  • 10+ years of experience operating production services.
  • Expert knowledge of Kubernetes administration.
  • Experience with infrastructure automation tools.
  • Proficiency in at least one high-level programming language.
  • In-depth knowledge of Linux OS and networking fundamentals.
  • Proficient in SRE principles and incident handling.
  • Experience with observability stacks using various tools.
  • Experience with GPU-accelerated clusters is a plus.
  • Ability to apply generative-AI techniques for operational efficiency.

Benefits

  • Work in a supportive and diverse environment.
  • Opportunity to drive innovation in AI and computing.
  • Engage with cutting-edge technology and projects.
  • Contribute to high-performance computing solutions.
  • Collaborate with talented professionals in the field.

N.U

NVIDIA USA

VN01 NVIDIA Vietnam Company Limited is a subsidiary of NVIDIA, a global leader in accelerated computing. The company focuses on pioneering technologies in AI and digital twins, transforming major industries and making a significant impact on society. With a commitment to innovation, NVIDIA Vietnam plays a crucial role in the manufacturing and engineering processes, ensuring high standards of manufacturability and production capabilities in a fast-paced environment. The team collaborates closely with global contract manufacturers and engineering teams to enhance production efficiency and drive continuous improvement.

Share This Job!

Save This Job!

Similar Jobs:

CI&T logo

Senior Cloud Site Reliability Engineer - Remote

CI&T

3 weeks ago

Join our team as a Senior Cloud Site Reliability Engineer, focusing on cloud technologies and infrastructure automation.

Brazil
Full-time
DevOps / Sysadmin
Kentik logo

Senior Site Reliability Engineer (Cloud) - Remote

Kentik

22 weeks ago

Kentik is seeking a Senior Site Reliability Engineer (Cloud) to enhance its cloud product lines in a fully remote role.

USA
Full-time
DevOps / Sysadmin
$159,000 - $215,000/year
Serve Robotics logo

Senior Cloud Site Reliability Engineer - Remote

Serve Robotics

29 weeks ago

Join Serve Robotics as a Senior Cloud Site Reliability Engineer to enhance system resiliency and availability while leading SRE practices.

Worldwide
Full-time
DevOps / Sysadmin
Rithum LinkedIn Board logo

Senior Site Reliability Engineer - Remote

Rithum LinkedIn Board

5 days ago

Join Rithum as a Senior Site Reliability Engineer to build and maintain large-scale, fault-tolerant systems while collaborating with cross-functional teams.

Worldwide
Full-time
DevOps / Sysadmin

F.P

Senior Site Reliability Engineer - Remote

Fullsteam Personnel

7 days ago

Join Fullsteam as a Senior Site Reliability Engineer to ensure the reliability and performance of our infrastructure and applications.

USA
Full-time
DevOps / Sysadmin