Remote Otter LogoRemoteOtter

Machine Learning Infrastructure Engineer - Remote

Posted 36 weeks ago
DevOps / Sysadmin
Full Time
Worldwide

Overview

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our ML infra including large GPU training and inference clusters.

In Short

  • Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
  • Implement and manage network-based cloud file systems and blob/S3 storage solutions
  • Develop and maintain Infrastructure as Code (IaC) for resource provisioning
  • Implement and optimize CI/CD pipelines for ML workflows
  • Design and implement custom autoscaling solutions for ML workloads
  • Ensure security best practices across the ML infrastructure
  • Provide developer-friendly tools and practices for efficient ML operations

Requirements

  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
  • Extensive experience with Kubernetes and Slurm cluster management
  • Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible)
  • Proven track record in managing and optimizing network-based cloud file systems and object storage
  • Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)
  • Strong understanding of security principles and best practices in cloud environments
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)
  • Familiarity with ML workflows and GPU infrastructure management
  • Demonstrated ability to handle complex migrations and breaking changes in production environments

Benefits

  • Experience with custom autoscaling solutions for ML workloads
  • Knowledge of cost optimization strategies for cloud-based ML infrastructure
  • Familiarity with MLOps practices and tools
  • Experience with high-performance computing (HPC) environments
  • Understanding of data versioning and experiment tracking for ML
  • Knowledge of network optimization for distributed ML training
  • Experience with multi-cloud or hybrid cloud architectures
  • Familiarity with container security and vulnerability scanning tools
Black Forest Labs logo

Black Forest Labs

Black Forest Labs is an innovative startup at the forefront of generative image and video technology. Known for developing groundbreaking models such as Stable Diffusion and Stable Video Diffusion, the company is dedicated to creating advanced AI media solutions. With a focus on building intuitive user interfaces and enhancing user experiences, Black Forest Labs collaborates closely with machine learning researchers and engineers. The company operates from key hubs in San Francisco, Germany, and London, while also considering remote work arrangements. Their mission is to revolutionize the way users interact with AI-generated content.

Share This Job!

Save This Job!

Similar Jobs:

Nextdoor

Machine Learning Infrastructure Engineer - Remote

Nextdoor

13 weeks ago

Join Nextdoor as a Machine Learning Infrastructure Engineer to build impactful ML systems in a collaborative environment.

CA, USA
Full-time
Software Development
$205,000 - $336,000/year
Waymo logo

Machine Learning Infrastructure Engineer - Remote

Waymo

16 weeks ago

Waymo is seeking a Machine Learning Infrastructure Engineer to develop large-scale inference solutions for autonomous driving technology.

CA, USA
Full-time
Software Development
$158,000 - $200,000 USD/year
Waymo logo

Machine Learning Infrastructure Engineer - Remote

Waymo

17 weeks ago

Waymo is seeking a Machine Learning Infrastructure Engineer to develop large-scale inference solutions for autonomous driving technology.

CA, USA
Full-time
Software Development
$192,000 - $243,000 USD/year
Waymo logo

Machine Learning Infrastructure Engineer - Remote

Waymo

18 weeks ago

Waymo is seeking a Machine Learning Infrastructure Engineer to develop and optimize distributed training infrastructure for autonomous driving technology.

CA, USA
Full-time
Software Development
$192,000 - $243,000 USD/year
Cantina logo

Lead Machine Learning Infrastructure Engineer - Remote

Cantina

33 weeks ago

Cantina is seeking a Tech Lead to guide the development of its machine learning infrastructure for AI-driven applications.

United States
Full-time
Software Development
$200,000 - $250,000/year