Remote Otter LogoRemoteOtter

AI Evaluation Dataset Creator - Remote

Posted Yesterday
Software Development
Contract
USA

Overview

Mercor is collaborating with a leading AI research lab to develop a next-generation evaluation dataset for frontier AI models. We are seeking experts with advanced domain knowledge across diverse fields to design extremely challenging prompts that cannot be solved by existing AI systems without internet search or browsing capabilities.

In Short

  • Create original, expert-level prompts that require tool use (e.g., search, browse, or code execution).
  • Ensure prompts are objective, self-contained, and yield clear, unambiguous answers.
  • Test prompts against advanced AI models and document failures/successes.
  • Provide reasoning steps and solutions for each prompt.
  • Classify prompts into subject domains for dataset organization.
  • Collaborate with reviewers for expert validation and prompt refinement.

Requirements

  • Advanced academic or professional expertise in a specialized subject (STEM, law, finance, history, cultural studies, etc.).
  • Strong ability to design precise, high-difficulty questions requiring deep knowledge and external references.
  • Experience in academic research, benchmarking, or test question design preferred.
  • Attention to detail and ability to provide concise reasoning explanations.
  • Familiarity with AI models and their limitations is a plus.

Benefits

  • Remote and asynchronous — set your own hours.
  • Expected commitment: ~10–20 hours/week.
  • Project duration: ~2 months, with possible extensions based on dataset needs.
  • Opportunity to contribute to high-impact AI safety and evaluation research.
Mercor logo

Mercor

HelixRecruit is a forward-thinking recruitment firm specializing in connecting talent with innovative companies. They focus on providing opportunities for individuals to engage in data annotation projects that enhance artificial intelligence systems. With a commitment to flexibility, HelixRecruit offers remote and asynchronous work arrangements, allowing contractors to set their own schedules while contributing to meaningful projects. The company values detail-oriented generalists and encourages applicants from diverse educational backgrounds, including students and early career professionals.

Share This Job!

Save This Job!

Similar Jobs:

Cohere logo

AI Data Trainer, Code Evaluation - Remote

Cohere

7 weeks ago

Join Cohere as a part-time AI Data Trainer, focusing on improving Large Language Models through data evaluation and labeling.

Canada
Part-time
All others
40 CAD/hour
Appen logo

AI Translation Evaluator – Danish - Remote

Appen

26 weeks ago

Join Project Babel as an AI Translation Evaluator to assess the quality of translations from English to Danish.

Denmark
Contract
All others
CrowdGen by Appen logo

AI Translation Evaluator – Croatian - Remote

CrowdGen by Appen

26 weeks ago

We are looking for AI Translation Evaluators fluent in Croatian to assess the quality of AI translations.

Croatia
Contract
All others
Appen logo

AI Translation Evaluator – Catalan - Remote

Appen

26 weeks ago

Join Project Babel as an AI Translation Evaluator to assess the quality of translations in Catalan.

Spain
Contract
All others
Appen logo

AI Translation Evaluator – Czech - Remote

Appen

26 weeks ago

Join Project Babel as an AI Translation Evaluator to assess and improve the quality of multilingual AI translations.

Czech Republic
Contract
All others