Remote Otter LogoRemoteOtter

Member of Engineering - Pre-training and Inference Fault Tolerance - Remote

Posted 2 weeks ago
Software Development
Full Time
Worldwide

Overview

As a member of the engineering team at Poolside, you will be focused on building out distributed training and inference for Large Language Models (LLMs), ensuring software reliability and fault tolerance.

In Short

  • Work in a remote-first team across Europe and North America.
  • Focus on distributed training and inference of LLMs.
  • Hands-on role with an emphasis on software reliability.
  • Debugging Linux kernel modules is part of the job.
  • Strong engineering skills and knowledge of Torch and NVIDIA GPU architecture required.
  • Design and develop tools to enhance training recovery.
  • Minimize GPU idle time during faults.
  • Write high-quality code in Python, Cython, C/C++, and CUDA.
  • Collaborate with a diverse team focused on quality systems.
  • Access to thousands of GPUs for testing changes.

Requirements

  • Understanding of Large Language Models (LLM) and Transformers.
  • Strong engineering background.
  • Programming experience in Linux API, Linux kernel.
  • Familiarity with Python, PyTorch, C/C++, NCCL.
  • Experience with distributed systems and reliability concepts.
  • Strong algorithmic skills.
  • Critical thinking and questioning of code quality policies.
  • Ability to work well in a fast-paced environment.
  • Prepared for a steep learning curve.
  • Strong communication skills.

Benefits

  • Fully remote work and flexible hours.
  • 37 days/year of vacation and holidays.
  • Health insurance allowance for you and dependents.
  • Company-provided equipment.
  • Wellbeing and home office allowances.
  • Frequent team get-togethers.
  • Diverse and inclusive culture.
poolside logo

poolside

Poolside is a forward-thinking company dedicated to advancing artificial intelligence to human-level intelligence and beyond. With a focus on software development, Poolside aims to create tools that empower developers and broaden access to software creation for billions of people worldwide. The company operates with a remote-first approach, fostering a collaborative and inclusive culture among its diverse team across Europe and North America. Poolside is committed to innovation in people operations, ensuring that its team can focus on their core missions while streamlining processes and enhancing productivity.

Share This Job!

Save This Job!

Similar Jobs:

Cohere logo

Member of Technical Staff, Training Infra Engineer - Remote

Cohere

28 weeks ago

Join Cohere as a Member of Technical Staff to enhance AI model training and infrastructure in a remote-friendly environment.

Worldwide
Full-time
Software Development
Cohere logo

Pre-Training Data Engineer - Remote

Cohere

10 weeks ago

Join Cohere as a Pre-Training Data Engineer to develop data infrastructure for advanced language models.

Worldwide
Full-time
Software Development
Eventual logo

Software Engineer, Pre-Training/AI - Remote

Eventual

30 weeks ago

Join Eventual as a Software Engineer focused on AI Pretraining, working on cutting-edge AI research and scalable data systems.

CA, USA
Full-time
Software Development

Enbridge

Engineer in Training I - Remote

Enbridge

6 weeks ago

Join Enbridge as an Engineer in Training I to work on hydraulic system modeling and distribution optimization engineering.

Canada
Full-time
All others
Anthropic logo

Research Engineer, Pre-training - Remote

Anthropic

7 weeks ago

Join Anthropic as a Research Engineer to develop the next generation of large language models, focusing on safe and steerable AI systems.

USA
Full-time
Software Development
$340,000 - $425,000 USD/year