Remote Otter LogoRemoteOtter

Machine Learning Engineer (Distributed Training) - Remote

Posted 2 days ago
Software Development
Full Time
Brazil

Overview

CloudWalk is seeking a Machine Learning Engineer to enhance our distributed training pipeline for large language models, focusing on optimizing and scaling training processes.

In Short

  • Own and maintain the distributed training pipeline.
  • Train LLMs using DeepSpeed, FSDP, and Hugging Face Accelerate.
  • Design and debug multi-node/multi-GPU training runs.
  • Optimize performance regarding memory usage and speed.
  • Manage experiment tracking and artifact storage.
  • Build scalable training templates for internal use.
  • Collaborate with researchers to improve training scripts.

Requirements

  • Expertise in distributed training with real-world setups.
  • Strong background in PyTorch.
  • Experience with the Hugging Face ecosystem.
  • Understanding of GPU, containers, and job schedulers.
  • Ability to write resilient code for training processes.
  • Collaborative mindset for improving team scripts.

Benefits

  • Opportunity to work in a fast-paced fintech environment.
  • Collaborate with innovative teams on cutting-edge technology.
  • Contribute to impactful projects that support entrepreneurs.
CloudWalk logo

CloudWalk

CloudWalk is one of the fastest growing fintech companies globally, recognized as a unicorn with millions of satisfied customers and substantial funding and revenue. The company prides itself on its dynamic and innovative culture, attracting talented individuals who embody grit and creativity. With a focus on building and learning rapidly, CloudWalk is not your typical startup; it fosters a collaborative environment where hackers, artists, and crafters can thrive. The mobile team is dedicated to developing high-quality applications for a vast user base, emphasizing deep collaboration with product and design experts. CloudWalk values diversity and inclusion, promoting a welcoming workplace where every employee can be their authentic self.

Share This Job!

Save This Job!

Similar Jobs:

Sajix Software Solution Private Limited logo

Machine Learning Engineer Trainee - Remote

Sajix Software Solution Private Limited

13 weeks ago

Join Sajix as a Machine Learning Engineer Trainee to assist in building and deploying ML models that enhance healthcare delivery.

USA
Internship
Software Development
FocusKPI logo

Machine Learning/AI Engineer Trainee - Remote

FocusKPI

34 weeks ago

Join our 3-month AI Trainee Program to gain hands-on experience in machine learning and software engineering.

USA
Internship
Software Development

Jobgether

Machine Learning Engineer - Remote

Jobgether

3 days ago

Join Nimble Gravity as a Machine Learning Engineer to design and deploy impactful AI solutions in LATAM.

Colombia
Full-time
Software Development

Toogeza

Machine Learning Engineer - Remote

Toogeza

5 days ago

We are seeking a Machine Learning Engineer to develop and maintain machine learning solutions for iGaming platforms.

Worldwide
Full-time
Software Development
Nimble Gravity logo

Machine Learning Engineer - Remote

Nimble Gravity

5 days ago

Join our Data & AI team as a Machine Learning Engineer to design and deploy intelligent models that address real-world challenges.

LATAM, USA
Full-time
Software Development