Research Engineer, Agentic AI Evals - Remote

Posted 18 weeks ago

Software Development

Full Time

Worldwide

AI Evaluation Frameworks

LLM Evaluation

Software Development

Problem-solving

Overview

HUD is developing agentic evals for Computer Use Agents (CUAs) that browse the web, aiming to provide detailed evaluations for AI agents to function effectively in real-world scenarios.

In Short

Build environments for CUA evaluation datasets.
Create custom CUA datasets/evaluation pipelines.
Proficiency in Python, Docker, and Linux environments required.
Experience with React for frontend development preferred.
Production-level software development experience is a plus.
Hands-on experience with LLM evaluation frameworks is beneficial.
Startup experience in early-stage tech companies is a plus.
Strong communication skills for remote collaboration needed.
Familiarity with AI tools and LLM capabilities is a plus.
Understanding of safety and alignment considerations in AI systems is preferred.

Requirements

Proficiency in Python, Docker, and Linux environments.
React experience for frontend development.
Production-level software development experience preferred.
Technical aptitude and problem-solving ability.
Experience with LLM evaluation frameworks and methodologies.
Contributed to evaluation harnesses (e.g., EleutherAI, Inspect).
Built custom evaluation pipelines or datasets.
Worked with agentic or multimodal AI evaluation systems.
Strong communication skills for remote collaboration.
Evidence of rapid learning and adaptability in technical environments.

Benefits

Remote-friendly work environment.
Support for relocation and visas for strong full-time candidates.
Opportunity to work with a talented team of AI researchers.
Fast-paced, dynamic work environment.
Rolling application process with quick interview timeline.

HUD

HUD (YC W25) is a pioneering company focused on developing agentic evaluations for Computer Use Agents (CUAs) that browse the web. Their innovative CUA Evals framework is the first comprehensive evaluation tool designed specifically for CUAs, addressing the critical need for detailed evaluations to ensure AI agents function effectively in real-world scenarios. Backed by Y Combinator, HUD collaborates closely with leading AI labs to provide scalable agent evaluation infrastructure. The team comprises highly skilled individuals, including international Olympiad medallists and experienced AI startup founders, dedicated to advancing the field of AI evaluation.

Share This Job!

Save This Job!

Jobs from HUD:

Open Role at HUD

AI Safety

AI Alignment

LLM Evaluation Frameworks