Senior AI Researcher- Pre-training (fmd)

Aleph Alpha


Job Location:

Heidelberg - Germany

Monthly Salary: Not Disclosed
Posted on: 30+ days ago
Vacancies: 1 Vacancy

Job Summary

Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers in finance manufacturing and public administration need models that understand German meet European regulatory requirements and work reliably in high-stakes settings. Were building that in Heidelberg.

We are hiring a Senior AI Researcher to join our Pre-training team and to advance the architecture and training of our next generation of foundation models. If you are excited about designing inference-efficient architectures optimising training recipes that scale reliably and training models on a large scale cluster (thousands of NVIDIA Blackwell GPUs) we would love to hear from you.

Team Culture

We foster a culture built on ownership autonomy and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organisational structure with efficient supportive management that enables quick decision-making open communication and a strong sense of shared purpose. We collaborate closely on complex technical problems working in pairs or using mob programming to resolve challenging issues.

About the Role

As a Senior AI Researcher in Pre-training (f/m/d) you will own the critical technical levers that determine the success of our next-generation models: architecture optimization stability and scaling.

Working at the high-leverage intersection of research and engineering you will translate mathematical reasoning and empirical observations into principled training decisions - from small-scale proxy experiments to multi-thousand-GPU runs.

We are looking for an expert who can combine rigorous experimental design with high-quality production code directly influencing model quality run reliability and the efficiency of the models we ship.

Your Responsibilities

  • Recipe & Architecture Optimization: Own core elements of the training recipe (optimizers schedules initialization) and design PyTorch-based architectural improvements to maximize convergence stability and training efficiency.

  • Scaling Strategy & Predictability: Develop hyperparameter scaling laws and scale-up methodologies using small-scale proxy experiments to reliably predict multi-thousand-GPU behavior and de-risk major training decisions.

  • Stability Diagnostics & Debugging: Investigate complex convergence issues (loss spikes divergence) and resolve hard-to-reproduce distributed system failures like communication bottlenecks race conditions and synchronization errors.

  • System-Model Co-Design: Partner with Compute Performance Data Evaluation and Post-Training teams to align the model lifecycle with hardware constraints memory bandwidth and communication topologies.

Core Qualifications

  • You are proficient in Python and deeply familiar with PyTorch-based training workflows.

  • You have a strong track record in machine learning research and software engineering demonstrated through shipped models impactful open-source contributions or published research.

  • You have a strong mathematical foundation and are comfortable reasoning formally about optimisation scaling behaviour and training dynamics.

  • You deeply understand transformer training dynamics optimisation and the behaviour of large distributed training jobs.

  • You can design rigorous experiments reason clearly from noisy results and translate empirical observations into robust training decisions.

  • Hands-on experience pre-training large models (e.g. 7B parameters) on substantial infrastructure (e.g. 100 GPU clusters).

  • You apply strong software engineering practices including writing maintainable well-tested code and supporting reproducible experimentation workflows.

  • You are able to implement complex model architectures efficiently and reliably and to debug complex issues across model code training dynamics and distributed systems.

  • You collaborate effectively within a research and engineering team and communicate clearly about your work across Pre-training and the broader AAR/AA organization.

  • You are able to work in Germany and collaborate regularly on site in Heidelberg as part of the Pre-training team.

Preferred Qualifications

(We encourage you to apply even if you dont check every box!)

  • Large-Scale Training: Hands-on experience training LLMs or multimodal models on large GPU clusters using distributed frameworks (e.g. Megatron-LM DeepSpeed torchtitan).

  • Predictive Scaling: Familiarity with scaling laws hyperparameter transfer or methods for predicting large-scale training behavior from smaller proxy runs.

  • Stability & Performance: Experience profiling distributed jobs and diagnosing training anomalies like loss spikes numerical instability or optimizer pathologies.

  • Advanced Architectures: Exposure to sparse training approaches (e.g. Mixture-of-Experts) and an understanding of their routing and systems trade-offs.

  • Track Record of Impact: Demonstrated research excellence through top-tier publications (NeurIPS ICML ICLR) impactful open-source contributions or significant shipped technical work.

  • Systems Curiosity: Low-level kernel optimization is not required but we highly value a strong curiosity about the hardware and systems constraints that shape scale.

What we offer

  • Become part of an AI revolution!

  • 30 days of paid vacation

  • Access to a variety of fitness & wellness offerings via Wellhub

  • Mental health support through

  • Substantially subsidized company pension plan for your future security

  • Subsidized Germany-wide transportation ticket

  • Budget for additional technical equipment

  • Flexible working hours for better work-life balance and hybrid working model

  • Virtual Stock Option Plan

  • JobRad Bike Lease


Required Experience:

Senior IC

Our MissionAleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers in finance manufacturing and public administration need models that understand German meet European regulatory requirements and work reliably in high-stakes settings. Were buildi...

About Company

Company Logo

Pioneering sovereign, European AI technology to transform human-machine interaction that can find solutions for the challenges of tomorrow.

View Profile View Profile