Machine Learning Eval Engineer

Job Location:

San Francisco, CA - USA

Monthly Salary: $ 150000 - 300000

Posted on: Yesterday

Vacancies: 1 Vacancy

Job Summary

Who is Recruiting from Scratch:

Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers software and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire.

Machine Learning Eval Engineer

Location - San Francisco CA (Onsite 5 Days per Week)

Compensation - $150000 $300000 Base Competitive Equity

Visa - Visa Sponsorship Available (Case-by-Case)

Company Stage - Series B ($100M Raised)

Industry - AI Infrastructure Machine Learning LLM Evaluation Document Intelligence Enterprise AI

About the Company

The company is building an AI-native document intelligence platform that enables enterprises to process understand and reason over complex unstructured documents at massive scale.

Its platform combines proprietary document understanding models frontier large language models and enterprise-grade AI infrastructure to power highly accurate document workflows across industries. Already trusted by leading AI companies Fortune-scale enterprises and quantitative trading firms the platform processes billions of documents while continuously improving model quality through sophisticated evaluation systems.

Backed by world-class venture investors and built by an elite engineering team from companies including Stripe Discord Scale AI and leading quantitative trading firms the company is rapidly expanding its machine learning organization while maintaining an exceptionally high technical bar.

As a Machine Learning Eval Engineer youll build the evaluation infrastructure that determines model quality identifies failure modes and drives improvements across production AI systems while working directly with machine learning platform and customer-facing teams.

This is a rare opportunity to join one of the fastest-growing AI infrastructure companies where youll directly influence how enterprise AI systems are measured improved and deployed at internet scale.

What Youll Do

Design and build scalable evaluation systems for production LLM applications
Develop benchmarks metrics and automated evaluation pipelines measuring model quality
Build workflows that identify failure modes across large-scale unstructured datasets
Design statistical evaluation methodologies using precision recall and model quality metrics
Build internal tooling and lightweight applications for model visualization and evaluation analysis
Work hands-on with enterprise documents including PDFs spreadsheets OCR outputs and unstructured data
Partner closely with ML engineers to prioritize model improvements using evaluation insights
Build customer-specific benchmarks demonstrating model performance across real-world workflows
Design evaluation infrastructure supporting production AI systems operating at massive scale
Collaborate with GTM Product and Engineering teams to communicate model performance
Prototype new evaluation techniques leveraging LLM-as-a-Judge methodologies
Own evaluation systems from initial design through production deployment

Ideal Candidate Background

Experience Requirements

15 years of experience in Machine Learning Engineering Software Engineering or ML Infrastructure
Strong sweet spot around 24 years of experience
Experience building evaluation systems ML tooling or data infrastructure from zero-to-one
Experience working at high-bar technology companies AI startups quantitative firms or leading research organizations
Experience working with production LLM applications
Experience building customer-facing ML tooling or internal AI platforms
Startup experience strongly preferred
Demonstrated ownership of high-impact technical initiatives

Technical Requirements

Strong Python engineering skills
Deep understanding of LLM evaluation methodologies including LLM-as-a-Judge
Strong prompt engineering experience
Strong understanding of precision recall statistical evaluation and ML metrics
Experience building evaluation pipelines or benchmarking systems
Comfortable building lightweight web applications using Flask TypeScript or similar frameworks
Experience working with unstructured data including documents PDFs OCR or document extraction
Familiarity with AWS S3 OLAP systems Tinybird or analytics infrastructure preferred
Experience working with Vision-Language Models or document AI preferred
Strong debugging experimentation and software engineering fundamentals

Education

Bachelors degree in Computer Science Mathematics Physics Machine Learning or related technical field preferred
Strong academic background from a top engineering or quantitative program preferred
Formal machine learning education or research experience preferred

Soft Skills

Strong analytical thinking
High ownership mentality
Excellent communication skills
Comfortable explaining technical concepts to non-technical stakeholders
Comfortable operating in ambiguity
Self-directed and proactive
Bias toward execution with technical precision
Startup mentality
Strong engineering craftsmanship

Preferred Backgrounds

AI infrastructure startups
LLM platform companies
Document AI companies
Machine learning platform teams
Quantitative trading firms
AI research organizations
Early-stage venture-backed startups
Evaluation infrastructure teams
Data infrastructure organizations
Engineers building production AI systems

Compensation & Benefits

Base Salary: $150000 $300000
Competitive Equity Package
Direct collaboration with ML and founding teams
Significant ownership over evaluation infrastructure
Opportunity to define model quality across enterprise AI systems
High-impact engineering role
Exposure to cutting-edge LLM and document AI technologies
Rapid career growth opportunities
Onsite collaboration with a world-class engineering team
Visa Sponsorship Available (Case-by-Case)

Why Join

This is an opportunity to define how one of the industrys leading AI infrastructure platforms measures improves and scales model quality.

Youll build evaluation systems benchmarks and tooling that directly influence production AI performance while collaborating closely with machine learning engineers platform teams and enterprise customers solving challenging real-world document intelligence problems.

As one of the early ML Evaluation Engineers youll have outsized ownership over model quality infrastructure while helping build AI systems trusted by leading enterprises and frontier AI companies.

Required Experience:

Who is Recruiting from Scratch:Recruiting from Scratch is a premier talent firm that focuses on placing the best product managers software and hardware talent at innovative companies. Our team is 100% remote and we work with teams across the United States to help them hire.Machine Learning Eval Engi...

Who is Recruiting from Scratch:

Machine Learning Eval Engineer

Location - San Francisco CA (Onsite 5 Days per Week)

Compensation - $150000 $300000 Base Competitive Equity

Visa - Visa Sponsorship Available (Case-by-Case)

Company Stage - Series B ($100M Raised)

Industry - AI Infrastructure Machine Learning LLM Evaluation Document Intelligence Enterprise AI

About the Company

The company is building an AI-native document intelligence platform that enables enterprises to process understand and reason over complex unstructured documents at massive scale.

What Youll Do

Design and build scalable evaluation systems for production LLM applications
Develop benchmarks metrics and automated evaluation pipelines measuring model quality
Build workflows that identify failure modes across large-scale unstructured datasets
Design statistical evaluation methodologies using precision recall and model quality metrics
Build internal tooling and lightweight applications for model visualization and evaluation analysis
Work hands-on with enterprise documents including PDFs spreadsheets OCR outputs and unstructured data
Partner closely with ML engineers to prioritize model improvements using evaluation insights
Build customer-specific benchmarks demonstrating model performance across real-world workflows
Design evaluation infrastructure supporting production AI systems operating at massive scale
Collaborate with GTM Product and Engineering teams to communicate model performance
Prototype new evaluation techniques leveraging LLM-as-a-Judge methodologies
Own evaluation systems from initial design through production deployment

Ideal Candidate Background

Experience Requirements

15 years of experience in Machine Learning Engineering Software Engineering or ML Infrastructure
Strong sweet spot around 24 years of experience
Experience building evaluation systems ML tooling or data infrastructure from zero-to-one
Experience working at high-bar technology companies AI startups quantitative firms or leading research organizations
Experience working with production LLM applications
Experience building customer-facing ML tooling or internal AI platforms
Startup experience strongly preferred
Demonstrated ownership of high-impact technical initiatives

Technical Requirements

Strong Python engineering skills
Deep understanding of LLM evaluation methodologies including LLM-as-a-Judge
Strong prompt engineering experience
Strong understanding of precision recall statistical evaluation and ML metrics
Experience building evaluation pipelines or benchmarking systems
Comfortable building lightweight web applications using Flask TypeScript or similar frameworks
Experience working with unstructured data including documents PDFs OCR or document extraction
Familiarity with AWS S3 OLAP systems Tinybird or analytics infrastructure preferred
Experience working with Vision-Language Models or document AI preferred
Strong debugging experimentation and software engineering fundamentals

Education

Bachelors degree in Computer Science Mathematics Physics Machine Learning or related technical field preferred
Strong academic background from a top engineering or quantitative program preferred
Formal machine learning education or research experience preferred

Soft Skills

Strong analytical thinking
High ownership mentality
Excellent communication skills
Comfortable explaining technical concepts to non-technical stakeholders
Comfortable operating in ambiguity
Self-directed and proactive
Bias toward execution with technical precision
Startup mentality
Strong engineering craftsmanship

Preferred Backgrounds

AI infrastructure startups
LLM platform companies
Document AI companies
Machine learning platform teams
Quantitative trading firms
AI research organizations
Early-stage venture-backed startups
Evaluation infrastructure teams
Data infrastructure organizations
Engineers building production AI systems

Compensation & Benefits

Base Salary: $150000 $300000
Competitive Equity Package
Direct collaboration with ML and founding teams
Significant ownership over evaluation infrastructure
Opportunity to define model quality across enterprise AI systems
High-impact engineering role
Exposure to cutting-edge LLM and document AI technologies
Rapid career growth opportunities
Onsite collaboration with a world-class engineering team
Visa Sponsorship Available (Case-by-Case)

Why Join

This is an opportunity to define how one of the industrys leading AI infrastructure platforms measures improves and scales model quality.

As one of the early ML Evaluation Engineers youll have outsized ownership over model quality infrastructure while helping build AI systems trusted by leading enterprises and frontier AI companies.

Required Experience:

Apply Now

About Company

Recruiting From Scratch

Senior software engineering jobs at top AI-native startups. Recruiting from Scratch advocates for candidates — 300+ placements, 29-day avg time to hire, 90+ NPS. Browse open roles.

View Profile View Profile

AI AutoApply

Apply to 100+ jobs with one click