Remote Otter LogoRemoteOtter

Site Reliability Engineer - Remote

Posted 44 weeks ago
DevOps / Sysadmin
Full Time
USA, United Kingdom

Overview

The Site Reliability Engineer (SRE) at Fluidstack plays a crucial role in ensuring the reliability and performance of the company's GPU cloud infrastructure, collaborating with various teams to optimize systems for AI workloads.

In Short

  • Work on deploying and managing GPU clusters for AI applications.
  • Collaborate with networking, platform engineering, and data center operations.
  • Tackle complex production issues and improve system stability.
  • Participate in an on-call rotation.
  • Write clean, well-documented code.
  • Experience in deploying Kubernetes and SLURM clusters.
  • Utilize automation tools like Ansible and Terraform.
  • Strong communication skills are essential.
  • Accountability and a customer-centric mindset are key.
  • Adapt to the dynamic nature of AI workloads.

Requirements

  • 2+ years of experience in SRE, DevOps, or Sysadmin roles.
  • Proficient in Go, Python, and Bash.
  • Experience with Kubernetes and SLURM.
  • Strong engineering background in related fields.
  • Excellent verbal and written communication skills.

Benefits

  • Competitive compensation package.
  • Health, dental, and vision insurance.
  • Generous PTO policy.
  • Retirement or pension plan.
  • Remote-first work environment with access to WeWork.

FluidStack

FluidStack

Fluidstack is at the forefront of building infrastructure for advanced artificial intelligence, collaborating with leading AI labs, governments, and enterprises to provide high-speed computing solutions. The company is dedicated to making artificial general intelligence (AGI) a reality, driven by a team that values excellence and customer outcomes. Fluidstack's People team is committed to creating an exceptional work environment, focusing on systems and support that empower employees to tackle meaningful challenges. The organization emphasizes thoughtful leadership support, ensuring executives can concentrate on critical decisions while fostering a culture of collaboration and continuous improvement.

Share This Job!

Save This Job!

Similar Jobs:

Panopto logo

Site Reliability Engineer - Remote

Panopto

41 weeks ago

Join Pano AI as a Site Reliability Engineer to enhance the reliability and performance of software systems in a dynamic startup environment.

CA, USA
Full-time
DevOps / Sysadmin
Arbor Education logo

Site Reliability Engineer - Remote

Arbor Education

42 weeks ago

Join Arbor as a Site Reliability Engineer to enhance platform resilience and performance in a remote role.

Worldwide
Full-time
DevOps / Sysadmin
£55,000 - £65,000/year
Arbor Education logo

Site Reliability Engineer - Remote

Arbor Education

42 weeks ago

Join Arbor as a Site Reliability Engineer and enhance platform resilience and performance.

United Kingdom
Full-time
DevOps / Sysadmin
£55,000 - £65,000/year
Roadie logo

Site Reliability Engineer - Remote

Roadie

42 weeks ago

Roadie is seeking a Site Reliability Engineer to support the reliability and performance of their logistics platform.

USA
Full-time
DevOps / Sysadmin
Weekday AI logo

Site Reliability Engineer - Remote

Weekday AI

43 weeks ago

We are seeking a skilled Site Reliability Engineer to automate operations and enhance system performance in a full-time role.

India
Full-time
DevOps / Sysadmin