Senior Site Reliability Engineer - Remote

Posted 71 weeks ago

DevOps / Sysadmin

Full Time

India

Service Level Indicators

Observability

Overview

Dremio is the unified lakehouse platform for self-service analytics and AI, serving hundreds of global enterprises, including Maersk, Amazon, Regeneron, NetApp, and S&P Global. Customers rely on Dremio for cloud, hybrid, and on-prem lakehouses to power their data mesh, data warehouse migration, data virtualization, and unified data access use cases. Based on open source technologies, including Apache Iceberg and Apache Arrow, Dremio provides an open lakehouse architecture enabling the fastest time to insight and platform flexibility at a fraction of the cost.

In Short

Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm.
Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure.
Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews.
Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams.
Collaborate within our virtual Observability team: develop and improve observability of the Dremio Cloud product.
Ability to debug and optimize code written by others and automate routine tasks.
Evangelize and advocate for resilience engineering and reliability practices across our organization.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Join an on-call rotation for systems and services that the SRE team owns.
Practice sustainable incident response and post-incident investigation analysis.

Requirements

10+ years of relevant experience in SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering.
Expertise in Kubernetes, Istio, Terraform, Terragrunt, ArgoCD/Flux.
Expertise with software defined networking infrastructure.
Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines.
Moderate-advanced experience in Python/Go, and at least reading knowledge of Java.
Systematic problem-solving approach with strong communication skills.
Ability to debug and optimize code and automate routine tasks.
Solid background in software development and architecting resilient applications.

Benefits

Workplace Wednesdays to improve cross-team communication.
Hybrid work environment.
Lunch catering and meal credits provided in the office.
Local socials align to Workplace Wednesdays.

Dremio

Dremio is a leading unified lakehouse platform designed for self-service analytics and AI, catering to a diverse range of global enterprises such as Maersk, Amazon, and Regeneron. The company specializes in providing cloud, hybrid, and on-prem lakehouses that facilitate data mesh, data warehouse migration, and data virtualization. Leveraging open-source technologies like Apache Iceberg and Apache Arrow, Dremio offers an open lakehouse architecture that ensures rapid insights and platform flexibility at a competitive cost. Dremio is committed to high standards of communication, accountability, and respect among its employees, fostering a dynamic and innovative work environment.

Senior Site Reliability Engineer - Remote

Overview

In Short

Requirements

Benefits

Dremio

Dremio

Similar Jobs: