Machine Learning Eval Engineer


Job Location:

San Francisco, CA - USA

Monthly Salary: $ 150000 - 300000
Posted on: Yesterday
Vacancies: 1 Vacancy

Job Summary

Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers software and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire.

Machine Learning Eval Engineer

Location - San Francisco CA (Onsite 5 Days per Week)

Compensation - $150000 $300000 Base Competitive Equity

Visa - Visa Sponsorship Available (Case-by-Case)

Company Stage - Series B ($100M Raised)

Industry - AI Infrastructure Machine Learning LLM Evaluation Document Intelligence Enterprise AI

About the Company

The company is building an AI-native document intelligence platform that enables enterprises to process understand and reason over complex unstructured documents at massive scale.

Its platform combines proprietary document understanding models frontier large language models and enterprise-grade AI infrastructure to power highly accurate document workflows across industries. Already trusted by leading AI companies Fortune-scale enterprises and quantitative trading firms the platform processes billions of documents while continuously improving model quality through sophisticated evaluation systems.

Backed by world-class venture investors and built by an elite engineering team from companies including Stripe Discord Scale AI and leading quantitative trading firms the company is rapidly expanding its machine learning organization while maintaining an exceptionally high technical bar.

As a Machine Learning Eval Engineer youll build the evaluation infrastructure that determines model quality identifies failure modes and drives improvements across production AI systems while working directly with machine learning platform and customer-facing teams.

This is a rare opportunity to join one of the fastest-growing AI infrastructure companies where youll directly influence how enterprise AI systems are measured improved and deployed at internet scale.

What Youll Do

  • Design and build scalable evaluation systems for production LLM applications
  • Develop benchmarks metrics and automated evaluation pipelines measuring model quality
  • Build workflows that identify failure modes across large-scale unstructured datasets
  • Design statistical evaluation methodologies using precision recall and model quality metrics
  • Build internal tooling and lightweight applications for model visualization and evaluation analysis
  • Work hands-on with enterprise documents including PDFs spreadsheets OCR outputs and unstructured data
  • Partner closely with ML engineers to prioritize model improvements using evaluation insights
  • Build customer-specific benchmarks demonstrating model performance across real-world workflows
  • Design evaluation infrastructure supporting production AI systems operating at massive scale
  • Collaborate with GTM Product and Engineering teams to communicate model performance
  • Prototype new evaluation techniques leveraging LLM-as-a-Judge methodologies
  • Own evaluation systems from initial design through production deployment

Ideal Candidate Background

Experience Requirements

  • 15 years of experience in Machine Learning Engineering Software Engineering or ML Infrastructure
  • Strong sweet spot around 24 years of experience
  • Experience building evaluation systems ML tooling or data infrastructure from zero-to-one
  • Experience working at high-bar technology companies AI startups quantitative firms or leading research organizations
  • Experience working with production LLM applications
  • Experience building customer-facing ML tooling or internal AI platforms
  • Startup experience strongly preferred
  • Demonstrated ownership of high-impact technical initiatives

Technical Requirements

  • Strong Python engineering skills
  • Deep understanding of LLM evaluation methodologies including LLM-as-a-Judge
  • Strong prompt engineering experience
  • Strong understanding of precision recall statistical evaluation and ML metrics
  • Experience building evaluation pipelines or benchmarking systems
  • Comfortable building lightweight web applications using Flask TypeScript or similar frameworks
  • Experience working with unstructured data including documents PDFs OCR or document extraction
  • Familiarity with AWS S3 OLAP systems Tinybird or analytics infrastructure preferred
  • Experience working with Vision-Language Models or document AI preferred
  • Strong debugging experimentation and software engineering fundamentals

Education

  • Bachelors degree in Computer Science Mathematics Physics Machine Learning or related technical field preferred
  • Strong academic background from a top engineering or quantitative program preferred
  • Formal machine learning education or research experience preferred

Soft Skills

  • Strong analytical thinking
  • High ownership mentality
  • Excellent communication skills
  • Comfortable explaining technical concepts to non-technical stakeholders
  • Comfortable operating in ambiguity
  • Self-directed and proactive
  • Bias toward execution with technical precision
  • Startup mentality
  • Strong engineering craftsmanship

Preferred Backgrounds

  • AI infrastructure startups
  • LLM platform companies
  • Document AI companies
  • Machine learning platform teams
  • Quantitative trading firms
  • AI research organizations
  • Early-stage venture-backed startups
  • Evaluation infrastructure teams
  • Data infrastructure organizations
  • Engineers building production AI systems

Compensation & Benefits

  • Base Salary: $150000 $300000
  • Competitive Equity Package
  • Direct collaboration with ML and founding teams
  • Significant ownership over evaluation infrastructure
  • Opportunity to define model quality across enterprise AI systems
  • High-impact engineering role
  • Exposure to cutting-edge LLM and document AI technologies
  • Rapid career growth opportunities
  • Onsite collaboration with a world-class engineering team
  • Visa Sponsorship Available (Case-by-Case)

Why Join

This is an opportunity to define how one of the industrys leading AI infrastructure platforms measures improves and scales model quality.

Youll build evaluation systems benchmarks and tooling that directly influence production AI performance while collaborating closely with machine learning engineers platform teams and enterprise customers solving challenging real-world document intelligence problems.

As one of the early ML Evaluation Engineers youll have outsized ownership over model quality infrastructure while helping build AI systems trusted by leading enterprises and frontier AI companies.


Required Experience:

IC

Who is Recruiting from Scratch:Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers software and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire.Machine Learning Eval Engi...

About Company

Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.

View Profile View Profile