Remote Otter LogoRemoteOtter

Site Reliability Engineer - Remote

Posted 4 weeks ago
DevOps / Sysadmin
Full Time
USA, United Kingdom

Overview

The Site Reliability Engineer (SRE) at Fluidstack plays a crucial role in ensuring the reliability and performance of the company's GPU cloud infrastructure, collaborating with various teams to optimize systems for AI workloads.

In Short

  • Work on deploying and managing GPU clusters for AI applications.
  • Collaborate with networking, platform engineering, and data center operations.
  • Tackle complex production issues and improve system stability.
  • Participate in an on-call rotation.
  • Write clean, well-documented code.
  • Experience in deploying Kubernetes and SLURM clusters.
  • Utilize automation tools like Ansible and Terraform.
  • Strong communication skills are essential.
  • Accountability and a customer-centric mindset are key.
  • Adapt to the dynamic nature of AI workloads.

Requirements

  • 2+ years of experience in SRE, DevOps, or Sysadmin roles.
  • Proficient in Go, Python, and Bash.
  • Experience with Kubernetes and SLURM.
  • Strong engineering background in related fields.
  • Excellent verbal and written communication skills.

Benefits

  • Competitive compensation package.
  • Health, dental, and vision insurance.
  • Generous PTO policy.
  • Retirement or pension plan.
  • Remote-first work environment with access to WeWork.

FluidStack

FluidStack

FluidStack is an innovative AI cloud company that collaborates with leading AI firms globally, including notable names like Poolside, Meta, Modal, and Reka. The company specializes in providing high-performance computing (HPC) as a service, ensuring that its GPU infrastructure operates at peak performance while offering exceptional support to its customers. FluidStack is committed to scaling its operations through automation and efficient deployment of new clusters, making it a key player in the AI cloud industry.

Share This Job!

Save This Job!

Similar Jobs:

Panopto logo

Site Reliability Engineer - Remote

Panopto

7 days ago

Join Pano AI as a Site Reliability Engineer to enhance the reliability and performance of software systems in a dynamic startup environment.

CA, USA
Full-time
DevOps / Sysadmin
Arbor Education logo

Site Reliability Engineer - Remote

Arbor Education

2 weeks ago

Join Arbor as a Site Reliability Engineer to enhance platform resilience and performance in a remote role.

Worldwide
Full-time
DevOps / Sysadmin
£55,000 - £65,000/year
Arbor Education logo

Site Reliability Engineer - Remote

Arbor Education

2 weeks ago

Join Arbor as a Site Reliability Engineer and enhance platform resilience and performance.

United Kingdom
Full-time
DevOps / Sysadmin
£55,000 - £65,000/year
Roadie logo

Site Reliability Engineer - Remote

Roadie

2 weeks ago

Roadie is seeking a Site Reliability Engineer to support the reliability and performance of their logistics platform.

USA
Full-time
DevOps / Sysadmin
Weekday AI logo

Site Reliability Engineer - Remote

Weekday AI

3 weeks ago

We are seeking a skilled Site Reliability Engineer to automate operations and enhance system performance in a full-time role.

India
Full-time
DevOps / Sysadmin