Remote Otter LogoRemoteOtter

Research Engineer, Agentic AI Evals - Remote

Posted 2 weeks ago
Software Development
Full Time
Worldwide

Overview

HUD is developing agentic evals for Computer Use Agents (CUAs) that browse the web, aiming to provide detailed evaluations for AI agents to function effectively in real-world scenarios.

In Short

  • Build environments for CUA evaluation datasets.
  • Create custom CUA datasets/evaluation pipelines.
  • Proficiency in Python, Docker, and Linux environments required.
  • Experience with React for frontend development preferred.
  • Production-level software development experience is a plus.
  • Hands-on experience with LLM evaluation frameworks is beneficial.
  • Startup experience in early-stage tech companies is a plus.
  • Strong communication skills for remote collaboration needed.
  • Familiarity with AI tools and LLM capabilities is a plus.
  • Understanding of safety and alignment considerations in AI systems is preferred.

Requirements

  • Proficiency in Python, Docker, and Linux environments.
  • React experience for frontend development.
  • Production-level software development experience preferred.
  • Technical aptitude and problem-solving ability.
  • Experience with LLM evaluation frameworks and methodologies.
  • Contributed to evaluation harnesses (e.g., EleutherAI, Inspect).
  • Built custom evaluation pipelines or datasets.
  • Worked with agentic or multimodal AI evaluation systems.
  • Strong communication skills for remote collaboration.
  • Evidence of rapid learning and adaptability in technical environments.

Benefits

  • Remote-friendly work environment.
  • Support for relocation and visas for strong full-time candidates.
  • Opportunity to work with a talented team of AI researchers.
  • Fast-paced, dynamic work environment.
  • Rolling application process with quick interview timeline.
HUD logo

HUD

HUD (YC W25) is a pioneering company focused on developing agentic evaluations for Computer Use Agents (CUAs) that browse the web. Their innovative CUA Evals framework is the first comprehensive evaluation tool designed specifically for CUAs, addressing the critical need for detailed evaluations to ensure AI agents function effectively in real-world scenarios. Backed by Y Combinator, HUD collaborates closely with leading AI labs to provide scalable agent evaluation infrastructure. The team comprises highly skilled individuals, including international Olympiad medallists and experienced AI startup founders, dedicated to advancing the field of AI evaluation.

Share This Job!

Save This Job!

Similar Jobs:

S.V.C.F

AI Research Engineer - Remote

Stealth Venture Capital Firm

9 weeks ago

Join a leading AI research lab as an AI Research Engineer, collaborating with top talent in a dynamic environment.

Worldwide
Full-time
Software Development

A.A

AI Research Engineer - Remote

Axelera AI

22 weeks ago

Join Axelera as an AI Research Engineer to advance data generation and optimization for cutting-edge AI models.

Italy
Full-time
Software Development
iGenius logo

AI Research Engineer - Remote

iGenius

24 weeks ago

Join our AI Research Team as an AI Research Engineer to bridge cutting-edge AI research with large-scale production.

Italy
Full-time
Software Development
Tiger Analytics logo

Agentic AI Engineer - Remote

Tiger Analytics

12 weeks ago

Join Tiger Analytics as an Agentic AI Engineer to leverage your expertise in Gen AI and Machine Learning for Fortune 500 clients.

USA
Full-time
Software Development
Pallon logo

Research Engineer - Remote

Pallon

11 weeks ago

Join Pallon as a Research Engineer to develop AI solutions for sewer inspection, contributing to urban sustainability.

Worldwide
Full-time
Software Development