In 2012, Lambda started with a crew of AI engineers publishing research at top machine-learning conferences. We began as an AI company built by AI engineers. That hasn't changed. Today, we're on a mission to be the world's top AI computing platform. We equip engineers with the tools to deploy AI that is fast, secure, affordable, and built to scale. Whether they need powerhouse GPU hardware on-site or the flexibility of cloud-based solutions, we've got the horsepower to make it happen. Lambda’s AI Cloud has been adopted by the world’s leading companies and research institutions including Anyscale, Rakuten, The AI Institute, and multiple enterprises with over a trillion dollars of market capitalization. Our goal is to make computation as effortless and ubiquitous as electricity.

In Short

Design and implement cloud-native architectures that deliver the "four nines" (99.99%) of reliability while balancing performance and cost efficiency.
Develop comprehensive monitoring and alerting systems with actionable dashboards that provide real-time visibility into system health.
Implement SLIs, SLOs, and SLAs across services and maintain error budgets to guide development priorities.
Automate deployments using tools like Argo and Terraform.
Create robust incident management processes, escalation paths, and documentation.
Architect fault-tolerant systems with graceful degradation capabilities to handle component failures.
Design and implement disaster recovery solutions with regular testing procedures.
Lead post-incident reviews that focus on systemic improvements rather than individual blame.
Champion reliability best practices and system design principles.
Build automated, auditable, and compliant processes to improve efficiency and productivity.

Requirements

5+ years of experience in Site Reliability Engineering or DevOps roles.
Strong understanding of cloud platforms (AWS, GCP, Azure) and their core services.
Experience designing and implementing monitoring and observability solutions at scale.
Proven track record managing production incidents and driving root cause analysis.
Proficiency with Infrastructure as Code tools and CI/CD pipeline implementation.
Strong understanding of network architecture, load balancing, and content delivery.
Expertise in performance tuning and system optimization techniques.
Experience with container orchestration platforms like Kubernetes.
Knowledge of database administration and optimization strategies.
Solid coding skills in at least one language (Python, Go, Bash) for automation.

Benefits

Founded in 2012, ~350 employees (2024) and growing fast.
We offer generous cash & equity compensation.
Health, dental, and vision coverage for you and your dependents.
Commuter/Work from home stipends for select roles.
401k Plan with 2% company match (USA employees).
Flexible Paid Time Off Plan that we all actually use.

Lambda

Founded in 2012, Lambda is a rapidly growing AI computing platform that originated from a team of AI engineers dedicated to advancing machine learning. The company focuses on providing engineers with robust tools for deploying AI solutions that are fast, secure, and scalable, whether through powerful on-site GPU hardware or flexible cloud-based options. Lambda's AI Cloud is trusted by leading companies and research institutions, aiming to make computation as accessible and essential as electricity. With a commitment to innovation and high demand for its systems, Lambda offers competitive compensation, comprehensive benefits, and a collaborative work environment.

Share This Job!

Save This Job!

Jobs from Lambda:

HR Business Partner (HRBP)

HR Business Partner

Emotional Intelligence

HR Generalist

HR Operations Team Lead