Experienced Site Reliability Engineer (SRE) - Remote

Infrastructure AS Code (IaC)

AWS Services

Terraform

Overview

Our Company is where we transform vision into reality. It's where ideas become technologies, and cutting-edge technologies become solutions for animal care and management.

We support farmers by providing real-time actionable information to help them manage their herds. It provides pet owners with smart devices and data that give them a better understanding of their pets’ activity and health needs, enriching relationships. It helps conservationists safeguard natural environments and wildlife.

Leveraging decades of Technological Research & Development experience across many markets, technologies and species, along with development environments and Quality Assurance procedures, we're always inventing new ways to look after the health and well-being of animals. Our decades of experience keep us ahead of the curve by leveraging advanced Technological Solutions from enhancing the precious bond between people and their pets, to advancing animal healthcare and wildlife preservation.

We are looking for an exceptional Senior Site Reliability Engineer (SRE) to help establish and lead the technical practices of SRE within our CloudOps team. This is a hands-on role for an experienced professional who can implement SRE principles, build frameworks and tools to ensure system reliability, and mentor others in adopting these practices.

If you are passionate about operational excellence, love solving complex technical challenges, and thrive in highly collaborative environments, this is the role for you.

What You’ll Do:

Define and Build the SRE Function

· Help to define and implement the SRE principles and practices.

· Partner with development and DevOps teams to create Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical services.

· Advocate for and implement system architectures that prioritize reliability, scalability, and fault tolerance.

Develop Automation and Resilience

· Build automation tools to reduce toil, streamline operations, and improve reliability using Infrastructure as Code (IaC) tools like Terraform and CrossPlane.

· Implement self-healing systems, automate incident detection and response, and integrate chaos engineering practices to test system resilience.

Drive Observability and Monitoring Excellence

· Create and maintain advanced observability systems with tools like DataDog, Prometheus, and Grafana to ensure uptime and system health.

· Develop efficient alerting and monitoring strategies, including synthetic tests and automated anomaly detection.

· Strong proven experience with AWS services and using IAC with Terraform.

· Analyze system logs and telemetry data to detect patterns, identify issues, and optimize system performance.

Incident Response and Problem Solving

· Take ownership of incident response processes, ensuring swift recovery of services and conducting thorough Root Cause Analysis (RCA) for long-term improvements.

· Document incident learnings and collaborate with teams to enhance on-call processes and system documentation.

Contribute to Continuous Improvement

· Improve deployment pipelines (CI/CD) using tools like GitHub Actions, Azure DevOps, or ArgoCD, ensuring smooth and reliable releases.

· Continuously evaluate and refine operational processes to reduce manual effort and increase efficiency.

In Short

Help define and implement SRE principles and practices.
Partner with teams to create SLOs, SLIs, and SLAs.
Implement system architectures for reliability and scalability.
Build automation tools using IaC tools like Terraform.
Implement self-healing systems and automate incident response.
Create observability systems with DataDog, Prometheus, and Grafana.
Take ownership of incident response processes.
Improve CI/CD pipelines using GitHub Actions and Azure DevOps.
Continuously refine operational processes.

Requirements

5+ years of hands-on experience in Site Reliability Engineering.
Proven expertise in AWS services and distributed architectures.
Experience with GitOps workflows and tools.
Advanced skills in automation tools like Terraform.
Exceptional problem-solving skills.
Effective communicator and collaborator.
Strong analytical skills in troubleshooting complex systems.
Familiarity with chaos engineering tools like Gremlin or LitmusChaos.

Benefits

Work in a collaborative environment.
Opportunity to lead technical practices.
Engage in innovative projects.
Contribute to animal care and management solutions.

MSD Animal Health Technology Labs

MSD Animal Health Technology Labs is a pioneering company dedicated to transforming innovative ideas into advanced technologies that enhance animal care and management. By providing farmers with actionable insights for herd management and offering pet owners smart devices to monitor their pets' health, the company enriches the human-animal bond. With decades of experience in technological research and development, MSD Animal Health is committed to advancing animal healthcare and wildlife preservation through cutting-edge solutions and quality assurance practices.

Share This Job!

Save This Job!

Jobs from MSD Animal Health Technology Labs:

Global AI Lead

AI Strategy

Team Leadership

Junior System Engineer - Data Analysis and Machine Learning

Data Analysis

Multidisciplinary Product Design

Snowflake

Electronics Engineer

Electronics Engineering

DFM

Data Engineer

Data Engineering

ETL Processes

SQL

Mobile Automation Engineer

Mobile Automation

QA Testing

IOS Testing

MSD Animal Health Technology Labs

Share This Job!

Save This Job!

Jobs from MSD Animal Health Technology Labs:

Global AI Lead

AI Strategy

Team Leadership

Junior System Engineer - Data Analysis and Machine Learning

Data Analysis

Multidisciplinary Product Design

Snowflake

Electronics Engineer

Electronics Engineering

DFM

Data Engineer

Data Engineering

ETL Processes

SQL

Mobile Automation Engineer

Mobile Automation

QA Testing

IOS Testing

Similar Jobs:

P.W

Site Reliability Engineer (SRE) - Remote

Point Wild

72 weeks ago

Point Wild

DevOps

AWS

Azure

Join Point Wild as a Site Reliability Engineer to maintain system reliability and performance in a dynamic engineering team.

DevOps

AWS

Azure

Worldwide

Full-time

DevOps / Sysadmin

72 weeks ago

Site Reliability Engineer (SRE) - Remote

Ensono

72 weeks ago

Ensono

Terraform

Azure DevOps

Ensono is looking for an experienced Site Reliability Engineer (SRE) to enhance their infrastructure and service management.

Terraform

Azure DevOps

USA

Full-time

DevOps / Sysadmin

$93,000 - $135,000/year

72 weeks ago

Site Reliability Engineer (SRE) - Remote

Element Solutions

72 weeks ago

Element Solutions

Cloud Migration

CI/CD

Element is seeking a motivated Site Reliability Engineer (SRE) to enhance cloud migration and collaborate on Infrastructure as Code and CI/CD efforts.

Cloud Migration