Remote Otter LogoRemoteOtter

Site Reliability Engineer - Remote

Posted Yesterday
DevOps / Sysadmin
Full Time
LATAM

Overview

This role involves ensuring the reliability, performance, and scalability of our MarTech SaaS platform that serves millions of users running thousands of marketing campaigns daily.

In Short

  • Monitor systems, respond to incidents, and implement automation to improve platform reliability.
  • Design, implement, and maintain comprehensive monitoring and alerting systems using tools such as Prometheus, Grafana, and DataDog.
  • Lead incident response efforts, conduct root cause analyses, and implement preventive measures.
  • Build and maintain automation tools and processes to reduce manual work and enhance system resilience.
  • Identify and implement reliability improvements across our platform.
  • Monitor system performance trends and plan for scaling needs.
  • Create and maintain runbooks, procedures, and system documentation.

Requirements

  • 3+ years of hands-on experience in site reliability engineering, DevOps, or similar roles.
  • Strong knowledge of SRE best practices including SLIs/SLOs, error budgets, and reliability engineering principles.
  • Cloud Platform experience with services like Compute Engine, Kubernetes, Cloud SQL, and related infrastructure components.
  • DataDog or similar expertise for monitoring, alerting, and observability.
  • Backend development experience with Java, PHP and/or Node.js.
  • Incident management skills including on-call experience and troubleshooting under pressure.
  • Automation mindset with experience in scripting and Infrastructure as Code principles.

Benefits

  • Remote-first culture with flexible working arrangements.
  • High-impact role in a small, collaborative team.
  • Growth opportunities as we scale our platform and expand our engineering team.
  • Competitive compensation and benefits package.
  • Learning budget for professional development and certifications.
  • Modern tech stack with opportunities to work with cutting-edge solutions.
SproutLoud Latam S.A.S logo

SproutLoud Latam S.A.S

SproutLoud Latam S.A.S is a dynamic MarTech SaaS company that specializes in providing innovative marketing technology solutions to businesses. With a focus on reliability, performance, and scalability, the company serves millions of users and supports thousands of marketing campaigns daily. SproutLoud fosters a remote-first culture, promoting flexibility and collaboration within a small, high-impact team. The company is committed to professional development, offering growth opportunities and a modern tech stack to its employees.

Share This Job!

Save This Job!

Similar Jobs:

PubNub logo

Site Reliability Engineer - Remote

PubNub

5 days ago

Join PubNub as a Site Reliability Engineer to support and improve real-time data streaming systems.

Poland
Contract
DevOps / Sysadmin
PLN14,000 - 20,300/month
MWDN logo

Site Reliability Engineer - Remote

MWDN

5 days ago

Join MWDN as a Site Reliability Engineer, focusing on cybersecurity and utilizing advanced technology to protect businesses from cyber threats.

Worldwide
Full-time
DevOps / Sysadmin
Everbridge logo

Site Reliability Engineer - Remote

Everbridge

6 days ago

Join the Everbridge Federal Platform team as a Site Reliability Engineer to ensure service quality and availability.

USA
Full-time
DevOps / Sysadmin
Seedify logo

Site Reliability Engineer - Remote

Seedify

1 week ago

Join Seedify as a Site Reliability Engineer to optimize and manage their AWS infrastructure and Kubernetes clusters.

Brazil
Full-time
DevOps / Sysadmin
Pythian logo

Site Reliability Engineer - Remote

Pythian

1 week ago

Join Pythian as a Site Reliability Engineer to design and operate large-scale distributed systems in a remote work environment.

Worldwide
Full-time
DevOps / Sysadmin