Remote Otter LogoRemoteOtter

Site Reliability Engineer - Remote

Posted 22 weeks ago
DevOps / Sysadmin
Full Time
Taiwan

Overview

Aethir is the only Enterprise-grade AI-focused GPU-as-a-service provider in the market. Its decentralized cloud computing infrastructure allows GPU providers (containers) to meet Enterprise clients who need powerful GPU chips for professional AI/ML tasks. Thanks to a constantly growing network of over 40,000 top-shelf GPUs, including 3,000 NVIDIA H100s, Aethir is able to provide enterprise-grade GPU computing wherever it’s needed, at scale.

In Short

  • Monitor, review, and respond to faults in the production system.
  • Continuously assess system architecture and performance.
  • Coordinate with the business team to resolve operational issues.
  • Respond promptly to production failures.
  • Organize teams to collaboratively solve problems.
  • Ensure timely resolution of issues.
  • Conduct case studies on production issues for optimization.
  • Maintain documentation of system architecture and processes.
  • Identify and implement improvements in operations.

Requirements

  • Experience in monitoring and troubleshooting production systems.
  • Strong understanding of system architecture and performance metrics.
  • Ability to coordinate with cross-functional teams.
  • Proven problem-solving skills in high-pressure situations.
  • Experience with documentation and process optimization.

Benefits

  • Opportunity to work with cutting-edge AI and GPU technologies.
  • Collaborative and innovative work environment.
  • Competitive salary and benefits package.
  • Flexible working arrangements.
  • Professional development opportunities.
Aethir logo

Aethir

Aethir is a pioneering provider of Enterprise-grade AI-focused GPU-as-a-service, leveraging a decentralized cloud computing infrastructure to connect GPU providers with enterprise clients in need of powerful GPU chips for AI and machine learning tasks. With a robust network of over 40,000 high-performance GPUs, including 3,000 NVIDIA H100s, Aethir delivers scalable and reliable GPU computing solutions. Backed by prominent Web3 investors and having raised over $130 million, Aethir is at the forefront of decentralized computing innovation, fostering a collaborative and dynamic work environment for its team.

Share This Job!

Save This Job!

Similar Jobs:

Software Mind logo

Site Reliability Engineer - Remote

Software Mind

6 weeks ago

Software Mind is looking for a Site Reliability Engineer to enhance the reliability of their software systems in a flexible and supportive work environment.

LATAM
Full-time
DevOps / Sysadmin
Jackbox Games logo

Site Reliability Engineer - Remote

Jackbox Games

7 weeks ago

Join Jackbox Games as a Site Reliability Engineer to maintain AWS infrastructure and develop applications in Go.

USA
Full-time
DevOps / Sysadmin
$103,326 - $190,465/year
Pinterest logo

Site Reliability Engineer - Remote

Pinterest

7 weeks ago

Pinterest is seeking a Site Reliability Engineer to ensure the reliability of its large-scale distributed systems.

USA
Full-time
Software Development
Printify logo

Site Reliability Engineer - Remote

Printify

7 weeks ago

Join our team as a Site Reliability Engineer, responsible for ensuring the reliability of our distributed systems and platforms in a dynamic international environment.

Worldwide
Full-time
DevOps / Sysadmin
Zepz logo

Site Reliability Engineer - Remote

Zepz

7 weeks ago

Join Zepz as a Site Reliability Engineer to enhance service stability and resilience through innovative automation and observability practices.

South Africa
Full-time
DevOps / Sysadmin