Remote Otter LogoRemoteOtter

Principal Site Reliability Engineer (AI-first SRE) - Remote

Posted 20 hours ago
DevOps / Sysadmin
Full Time
Argentina, Brazil, Chile, Colombia, Ecuador, Mexico, Peru, Uruguay

Overview

Groupon is a marketplace where customers discover new experiences and services every day and local businesses thrive. To date we have worked with over a million merchant partners worldwide, connecting over 16 million customers with deals across various categories. In a world often dominated by e-commerce giants, we stand out as one of the few platforms uniquely committed to helping local businesses succeed on a performance basis.

Groupon is on a radical journey to transform our business with relentless pursuit of results. Even with thousands of employees spread across multiple continents, we still maintain a culture that inspires innovation, rewards risk-taking and celebrates success. The impact here can be immediate due to our scale and the speed of our transformation. We’re a "best of both worlds" kind of company. We’re big enough to have the resources and scale, but small enough that a single person has a surprising amount of autonomy and can make a meaningful impact.

In Short

  • Architect and maintain self-healing systems with 99.9%+ availability targets.
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns.
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data.
  • Build AIOps-based observability and auto-remediation pipelines.
  • Apply predictive modeling to forecast failures before they impact users.
  • Lead chaos, performance, and resilience testing programs.
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance.
  • Mentor engineers and drive reliability standards across teams.
  • Partner with platform, data, and product teams to ensure stability aligns with business goals.
  • Support major incident response, incident review, and participate in on-call rotations.

Requirements

  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability.
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform.
  • Proficiency in Python or Go for automation and tooling.
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy).
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations.
  • Strong communication and influencing skills — data over hierarchy.

Benefits

  • The opportunity to work with cutting-edge technologies in a transformative environment.
  • Professional growth and leadership development pathways tailored to your aspirations.
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems.

Groupon

Groupon

Groupon is a leading marketplace that connects customers with local businesses, offering a platform for discovering new experiences and services. With over a million merchant partners worldwide and more than 16 million customers, Groupon is dedicated to helping local businesses thrive in a competitive e-commerce landscape. The company fosters a culture of innovation and autonomy, allowing employees to make significant impacts while benefiting from the resources and scale of a large organization. Groupon is committed to transforming its business and enhancing customer experiences through a focus on performance and operational excellence.

Share This Job!

Save This Job!

Similar Jobs:

Expel logo

Principal Site Reliability Engineer - Remote

Expel

13 weeks ago

Join Expel as a Principal Site Reliability Engineer to lead initiatives ensuring service reliability and mentor junior engineers.

USA
Full-time
DevOps / Sysadmin
$167,300 - $242,600/year
Jobgether logo

Principal Site Reliability Engineer - Remote

Jobgether

13 weeks ago

Seeking a Principal Site Reliability Engineer to architect and maintain hybrid infrastructures in a collaborative environment.

USA
Full-time
DevOps / Sysadmin
Jobgether logo

Principal Site Reliability Engineer - Remote

Jobgether

22 weeks ago

We are looking for a Principal Site Reliability Engineer to enhance the reliability and efficiency of large-scale distributed systems in a hybrid remote setup.

USA
Full-time
DevOps / Sysadmin
Upwork logo

Principal Site Reliability Engineer - Remote

Upwork

27 weeks ago

Join Upwork as a Principal Site Reliability Engineer to lead and innovate in SRE practices for a global team.

Worldwide
Full-time
DevOps / Sysadmin
Cribl logo

Principal Site Reliability Engineer - Remote

Cribl

32 weeks ago

Join Cribl as a Principal Site Reliability Engineer to enhance observability and reliability in software systems.

USA
Full-time
DevOps / Sysadmin
$240,000 - $400,000/year