Remote Otter LogoRemoteOtter

Experienced Site Reliability Engineer (SRE) - Remote

Posted 13 weeks ago
DevOps / Sysadmin
Full Time
Worldwide

Overview

Our Company is where we transform vision into reality. It's where ideas become technologies, and cutting-edge technologies become solutions for animal care and management.

We support farmers by providing real-time actionable information to help them manage their herds. It provides pet owners with smart devices and data that give them a better understanding of their pets’ activity and health needs, enriching relationships. It helps conservationists safeguard natural environments and wildlife.

Leveraging decades of Technological Research & Development experience across many markets, technologies and species, along with development environments and Quality Assurance procedures, we're always inventing new ways to look after the health and well-being of animals. Our decades of experience keep us ahead of the curve by leveraging advanced Technological Solutions from enhancing the precious bond between people and their pets, to advancing animal healthcare and wildlife preservation.


We are looking for an exceptional Senior Site Reliability Engineer (SRE) to help establish and lead the technical practices of SRE within our CloudOps team. This is a hands-on role for an experienced professional who can implement SRE principles, build frameworks and tools to ensure system reliability, and mentor others in adopting these practices.

If you are passionate about operational excellence, love solving complex technical challenges, and thrive in highly collaborative environments, this is the role for you.


What You’ll Do:

Define and Build the SRE Function

·      Help to define and implement the SRE principles and practices.

·      Partner with development and DevOps teams to create Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical services.

·      Advocate for and implement system architectures that prioritize reliability, scalability, and fault tolerance.

Develop Automation and Resilience

·      Build automation tools to reduce toil, streamline operations, and improve reliability using Infrastructure as Code (IaC) tools like Terraform and CrossPlane.

·      Implement self-healing systems, automate incident detection and response, and integrate chaos engineering practices to test system resilience.

Drive Observability and Monitoring Excellence

·      Create and maintain advanced observability systems with tools like DataDog, Prometheus, and Grafana to ensure uptime and system health.

·      Develop efficient alerting and monitoring strategies, including synthetic tests and automated anomaly detection.

·      Strong proven experience with AWS services and using IAC with Terraform.

·      Analyze system logs and telemetry data to detect patterns, identify issues, and optimize system performance.

Incident Response and Problem Solving

·      Take ownership of incident response processes, ensuring swift recovery of services and conducting thorough Root Cause Analysis (RCA) for long-term improvements.

·      Document incident learnings and collaborate with teams to enhance on-call processes and system documentation.

Contribute to Continuous Improvement

·      Improve deployment pipelines (CI/CD) using tools like GitHub Actions, Azure DevOps, or ArgoCD, ensuring smooth and reliable releases.

·      Continuously evaluate and refine operational processes to reduce manual effort and increase efficiency.



In Short

  • Help define and implement SRE principles and practices.
  • Partner with teams to create SLOs, SLIs, and SLAs.
  • Implement system architectures for reliability and scalability.
  • Build automation tools using IaC tools like Terraform.
  • Implement self-healing systems and automate incident response.
  • Create observability systems with DataDog, Prometheus, and Grafana.
  • Take ownership of incident response processes.
  • Improve CI/CD pipelines using GitHub Actions and Azure DevOps.
  • Continuously refine operational processes.

Requirements

  • 5+ years of hands-on experience in Site Reliability Engineering.
  • Proven expertise in AWS services and distributed architectures.
  • Experience with GitOps workflows and tools.
  • Advanced skills in automation tools like Terraform.
  • Exceptional problem-solving skills.
  • Effective communicator and collaborator.
  • Strong analytical skills in troubleshooting complex systems.
  • Familiarity with chaos engineering tools like Gremlin or LitmusChaos.

Benefits

  • Work in a collaborative environment.
  • Opportunity to lead technical practices.
  • Engage in innovative projects.
  • Contribute to animal care and management solutions.
MSD Animal Health Technology Labs logo

MSD Animal Health Technology Labs

MSD Animal Health Technology Labs is a pioneering company dedicated to transforming innovative ideas into advanced technologies that enhance animal care and management. By providing farmers with actionable insights for herd management and offering pet owners smart devices to monitor their pets' health, the company enriches the human-animal bond. With decades of experience in technological research and development, MSD Animal Health is committed to advancing animal healthcare and wildlife preservation through cutting-edge solutions and quality assurance practices.

Share This Job!

Save This Job!

Similar Jobs:

P.W

Site Reliability Engineer (SRE) - Remote

Point Wild

7 weeks ago

Join Point Wild as a Site Reliability Engineer to maintain system reliability and performance in a dynamic engineering team.

Worldwide
Full-time
DevOps / Sysadmin
Ensono logo

Site Reliability Engineer (SRE) - Remote

Ensono

7 weeks ago

Ensono is looking for an experienced Site Reliability Engineer (SRE) to enhance their infrastructure and service management.

USA
Full-time
DevOps / Sysadmin
$93,000 - $135,000/year
Element Solutions logo

Site Reliability Engineer (SRE) - Remote

Element Solutions

7 weeks ago

Element is seeking a motivated Site Reliability Engineer (SRE) to enhance cloud migration and collaborate on Infrastructure as Code and CI/CD efforts.

USA
Full-time
DevOps / Sysadmin
Capital Markets Gateway logo

Site Reliability Engineer (SRE) - Remote

Capital Markets Gateway

8 weeks ago

CMG is seeking a Site Reliability Engineer to enhance the reliability and performance of their infrastructure and applications.

Brazil
Full-time
DevOps / Sysadmin
Ververica logo

Site Reliability Engineer (SRE) - Remote

Ververica

8 weeks ago

Join Ververica as a Site Reliability Engineer to design and maintain infrastructure for a Unified Streaming Data Platform.

Germany
Full-time
DevOps / Sysadmin