Inference Optimization Engineer - Remote

Posted 51 weeks ago

Software Development

Full Time

CA, USA

Inference Optimization

Large Language Models

Overview

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures.

In Short

Identify bottlenecks and optimize inference efficiency.
Build repeatable tests that model production traffic.
Reduce memory use and compute cost with mixed precision.
Improve batching, caching, load balancing, and model-parallel execution.
Write technical posts and contribute to the open-source community.

Requirements

Deep understanding of transformer architecture.
Hands-on experience with model serving optimizations.
Experience with inference engines like vLLM, SGLang, or TRT-LLM.
Proficiency in CUDA and profiling tools.
Track record of blog posts or conference talks in ML systems.

Benefits

Direct impact on distributed LLM inference.
Work remotely from anywhere.
Competitive salary and equity.
Learning budget and paid conference travel.

BentoML

BentoML is a prominent provider of inference platforms designed to assist AI teams in efficiently running large language models and generative AI workloads at scale. Backed by investors like DCM, the company serves enterprises globally, ensuring consistent scalability and performance in production environments. BentoML offers a diverse portfolio that includes both open-source and commercial products, with a mission to empower teams to leverage AI for building competitive advantages.

Share This Job!

Save This Job!