Build the systems that power large-scale machine learning training.
Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation and evaluation - including work on large language models.
This role sits at the intersection of software engineering and applied ML. Youll design and ship internal tools used daily by researchers and engineers to move faster debug training issues and ship higher quality models.
What Youll Work On
Build and scale internal ML workflow systems: dataset creation training orchestration experiment tracking evaluation and model/version management
Contribute hands-on to model training and fine-tuning efforts (LLMs preferred other deep learning experience welcome)
Develop backend services and APIs in Python supporting training and evaluation pipelines
Build TypeScript-based UIs that allow users to launch runs compare experiments inspect metrics and debug failures
Design efficient SQL schemas and queries with attention to performance and indexing tradeoffs
Improve reliability and reproducibility of ML systems through testing CI/CD monitoring and safe rollouts
Partner closely with ML researchers infrastructure and product teams to turn ambiguous needs into shipped systems
What Were Looking For
8 years of professional software engineering experience owning complex systems in production
Strong full-stack experience across Python backend SQL data layers and TypeScript frontends
Hands-on experience with machine learning training workflows (beyond calling model APIs)
Experience training or fine-tuning ML models (LLMs a plus)
Comfortable operating in ambiguous fast moving environments with high ownership
Nice to Have
Experience building internal ML developer tooling (experiment tracking eval frameworks model registries)
Familiarity with training infrastructure concepts (orchestration checkpointing failure recovery)
Exposure to LLM evaluation methods and quality metrics
Startup or research adjacent engineering roles
Why This Role
Work on cutting edge machine learning systems with real research impact
High ownership and technical depth - not a narrow product role
Collaborative team environment focused on quality speed and experimentation
Job Description: Build the systems that power large-scale machine learning training. Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation ...
Job Description:
Build the systems that power large-scale machine learning training.
Were partnering with a highly respected AI research and engineering organization to hire a Full Stack Engineer focused on building end to end systems that support machine learning model training experimentation and evaluation - including work on large language models.
This role sits at the intersection of software engineering and applied ML. Youll design and ship internal tools used daily by researchers and engineers to move faster debug training issues and ship higher quality models.
What Youll Work On
Build and scale internal ML workflow systems: dataset creation training orchestration experiment tracking evaluation and model/version management
Contribute hands-on to model training and fine-tuning efforts (LLMs preferred other deep learning experience welcome)
Develop backend services and APIs in Python supporting training and evaluation pipelines
Build TypeScript-based UIs that allow users to launch runs compare experiments inspect metrics and debug failures
Design efficient SQL schemas and queries with attention to performance and indexing tradeoffs
Improve reliability and reproducibility of ML systems through testing CI/CD monitoring and safe rollouts
Partner closely with ML researchers infrastructure and product teams to turn ambiguous needs into shipped systems
What Were Looking For
8 years of professional software engineering experience owning complex systems in production
Strong full-stack experience across Python backend SQL data layers and TypeScript frontends
Hands-on experience with machine learning training workflows (beyond calling model APIs)
Experience training or fine-tuning ML models (LLMs a plus)