Remote Otter LogoRemoteOtter

Site Reliability Engineer - Remote

Posted 15 weeks ago

Overview

Voltage Park’s mission is to make AI infrastructure accessible to all. Today, we own 24,000+ H100s and operate 7+ data-centers across the US. We serve customers of all sizes, from small research labs to large enterprises. As part of this effort, we’re hiring a Site Reliability Engineer to be responsible for building out and operating our core infrastructure, including bare metal provisioning, telemetry, storage, and container / VM orchestration.

In Short

  • Design, build, and roll out new platforms to minimize incidents.
  • Deploy updates and improvements for internal and customer use cases.
  • Collaborate with network engineering, software development, and customer support.
  • Participate in the SRE on-call rotation.

Requirements

  • 8+ years working with Linux, preferably Ubuntu.
  • 5+ years experience with AWS.
  • 2+ years experience with Kubernetes.
  • 2+ years experience with Terraform and Ansible.
  • 2+ years with network attached storage management.
  • Experience in a Slack-first, asynchronous remote work environment.
  • Experience with monitoring systems like Prometheus and ELK stack.
  • Familiarity with gitops workflow.
  • Software development experience using Python, Go, or bash.
  • Deep networking fundamentals.
  • Experience architecting and delivering complex systems.
  • Strong written and oral communication skills.

Benefits

  • Work with a small group of friendly, motivated colleagues.
  • High degree of autonomy in work.
  • Opportunity to wear multiple hats and venture outside comfort zone.
  • Importance of good documentation is valued.

Similar Jobs:

Software Mind logo

Site Reliability Engineer - Remote

Software Mind

2 days ago

Software Mind is looking for a Site Reliability Engineer to enhance the reliability of their software systems in a flexible and supportive work environment.

Site Reliability Engineering
Cloud Native Applications
Azure
AWS
LATAM
Full-time
DevOps / Sysadmin
Jackbox Games logo

Site Reliability Engineer - Remote

Jackbox Games

1 week ago

Join Jackbox Games as a Site Reliability Engineer to maintain AWS infrastructure and develop applications in Go.

Site Reliability Engineering
AWS
GO
ECS
USA
Full-time
DevOps / Sysadmin
$103,326 - $190,465/year
Pinterest logo

Site Reliability Engineer - Remote

Pinterest

1 week ago

Pinterest is seeking a Site Reliability Engineer to ensure the reliability of its large-scale distributed systems.

Site Reliability Engineering
Python
GO
Linux
USA
Full-time
Software Development
Printify logo

Site Reliability Engineer - Remote

Printify

1 week ago

Join our team as a Site Reliability Engineer, responsible for ensuring the reliability of our distributed systems and platforms in a dynamic international environment.

Site Reliability Engineering
System Design
Development
Configuration
Worldwide
Full-time
DevOps / Sysadmin
Zepz logo

Site Reliability Engineer - Remote

Zepz

1 week ago

Join Zepz as a Site Reliability Engineer to enhance service stability and resilience through innovative automation and observability practices.

SRE
DevOps
Automation
Monitoring
South Africa
Full-time
DevOps / Sysadmin