Reflection is a research lab making intelligence open and accessible for everyone to use customize and build on. We build open models that let anyone control their intelligence and help shape the future of AI. Our mission: make intelligence open and accessible to all.
About the Role
The web is one of the most important sources of information for frontier AI systems. The quality coverage freshness and diversity of web data directly influence model capabilities.
As a member of the Data Team your mission is to build and operate large-scale web crawling systems that continuously discover acquire and process content from across the internet. You will own the infrastructure that powers web-scale data collection from URL discovery and scheduling to distributed crawling content extraction and dataset delivery.
You will work directly with world-class researchers to understand which parts of the web matter most for model performance and build systems that efficiently acquire high-value content at scale.
This role is ideal for engineers who love building distributed systems optimizing large-scale crawlers and solving the unique technical challenges of collecting data from the modern web.
What Youll Do
Working closely with our pre-training infrastructure and data quality teams you will:
Build and operate web-scale crawling infrastructure capable of continuously collecting data across billions of URLs
Design and optimize URL discovery prioritization scheduling and crawl orchestration systems
Develop distributed crawlers that efficiently acquire content while respecting site constraints and operational requirements
Build systems for content extraction rendering parsing and normalization across diverse web formats
Improve crawl coverage freshness efficiency and quality through measurement and experimentation
Design infrastructure for large-scale recrawling change detection and incremental updates
Develop specialized crawlers for high-value domains dynamic websites and difficult-to-access content sources
Analyze crawl performance and web coverage to identify gaps inefficiencies and opportunities for improvement
Build observability monitoring and reliability systems for large-scale crawl operations
Debug production issues and continuously improve the performance scalability and resilience of crawling infrastructure
About You
Passionate about web-scale systems and the challenges of collecting information from the internet
Curious about how web data influences model capabilities and willing to iterate based on downstream results
Comfortable balancing crawl quality coverage freshness and operational efficiency
Enjoy working at the intersection of distributed systems data infrastructure and AI
Able to collaborate closely with researchers infrastructure engineers and data quality teams
Skills and Qualifications
Experience building large-scale web crawling search indexing content acquisition or internet-scale data collection systems
Strong understanding of crawling architectures URL frontier management scheduling and distributed crawl coordination
Experience with large-scale distributed systems using technologies such as Ray Spark Beam Flink or similar frameworks
Familiarity with content extraction HTML parsing browser automation rendering systems and modern web technologies
Experience operating systems that process petabyte-scale datasets
Strong systems engineering skills including reliability observability performance optimization and debugging
Experience designing experiments and using data to improve crawl quality coverage and efficiency
Excellent communication skills and the ability to reason clearly about system tradeoffs and operational constraints
Nice to Have
Experience building search engines web indexes or internet-scale crawling platforms
Familiarity with anti-bot systems dynamic web content browser automation and large-scale extraction pipelines
Understanding of how web data is used in training and evaluating large language models
Experience with distributed storage systems content deduplication and web-scale dataset management
What We Offer:
We believe that to make intelligence open and accessible to all you need to start at the foundation. Joining Reflection means building from the ground up as part of a talent-dense team. You will help define our future as a company and help define the future of open foundational models.
We want you to do the most impactful work of your career with the confidence that you and the people you care about most are supported.
Top-tier compensation: Salary and equity structured to recognize and retain our talent globally.
Stock options: Everyone who joins and contributes to Reflections success gets to share in the upside through stock options.
Health & wellness: Comprehensive medical dental vision and life with an annual wellness allowance.
Meals: Lunch and dinner are provided in the office daily.
Life & family: 22 weeks paid parental leave for all new birthing and non-birthing parents including adoptive and surrogate journeys.
Vacation days: Unlimited paid time off in the U.S. and 30 days in the U.K.
Sponsorship support: We sponsor visas to help exceptional talent join our team and support long-term immigration pathways where applicable.
Team building: We have regular off-sites happy hours and team celebrations.
Required Experience:
Staff IC
Our MissionReflection is a research lab making intelligence open and accessible for everyone to use customize and build on. We build open models that let anyone control their intelligence and help shape the future of AI. Our mission: make intelligence open and accessible to all.About the RoleThe web...
Our Mission
Reflection is a research lab making intelligence open and accessible for everyone to use customize and build on. We build open models that let anyone control their intelligence and help shape the future of AI. Our mission: make intelligence open and accessible to all.
About the Role
The web is one of the most important sources of information for frontier AI systems. The quality coverage freshness and diversity of web data directly influence model capabilities.
As a member of the Data Team your mission is to build and operate large-scale web crawling systems that continuously discover acquire and process content from across the internet. You will own the infrastructure that powers web-scale data collection from URL discovery and scheduling to distributed crawling content extraction and dataset delivery.
You will work directly with world-class researchers to understand which parts of the web matter most for model performance and build systems that efficiently acquire high-value content at scale.
This role is ideal for engineers who love building distributed systems optimizing large-scale crawlers and solving the unique technical challenges of collecting data from the modern web.
What Youll Do
Working closely with our pre-training infrastructure and data quality teams you will:
Build and operate web-scale crawling infrastructure capable of continuously collecting data across billions of URLs
Design and optimize URL discovery prioritization scheduling and crawl orchestration systems
Develop distributed crawlers that efficiently acquire content while respecting site constraints and operational requirements
Build systems for content extraction rendering parsing and normalization across diverse web formats
Improve crawl coverage freshness efficiency and quality through measurement and experimentation
Design infrastructure for large-scale recrawling change detection and incremental updates
Develop specialized crawlers for high-value domains dynamic websites and difficult-to-access content sources
Analyze crawl performance and web coverage to identify gaps inefficiencies and opportunities for improvement
Build observability monitoring and reliability systems for large-scale crawl operations
Debug production issues and continuously improve the performance scalability and resilience of crawling infrastructure
About You
Passionate about web-scale systems and the challenges of collecting information from the internet
Curious about how web data influences model capabilities and willing to iterate based on downstream results
Comfortable balancing crawl quality coverage freshness and operational efficiency
Enjoy working at the intersection of distributed systems data infrastructure and AI
Able to collaborate closely with researchers infrastructure engineers and data quality teams
Skills and Qualifications
Experience building large-scale web crawling search indexing content acquisition or internet-scale data collection systems
Strong understanding of crawling architectures URL frontier management scheduling and distributed crawl coordination
Experience with large-scale distributed systems using technologies such as Ray Spark Beam Flink or similar frameworks
Familiarity with content extraction HTML parsing browser automation rendering systems and modern web technologies
Experience operating systems that process petabyte-scale datasets
Strong systems engineering skills including reliability observability performance optimization and debugging
Experience designing experiments and using data to improve crawl quality coverage and efficiency
Excellent communication skills and the ability to reason clearly about system tradeoffs and operational constraints
Nice to Have
Experience building search engines web indexes or internet-scale crawling platforms
Familiarity with anti-bot systems dynamic web content browser automation and large-scale extraction pipelines
Understanding of how web data is used in training and evaluating large language models
Experience with distributed storage systems content deduplication and web-scale dataset management
What We Offer:
We believe that to make intelligence open and accessible to all you need to start at the foundation. Joining Reflection means building from the ground up as part of a talent-dense team. You will help define our future as a company and help define the future of open foundational models.
We want you to do the most impactful work of your career with the confidence that you and the people you care about most are supported.
Top-tier compensation: Salary and equity structured to recognize and retain our talent globally.
Stock options: Everyone who joins and contributes to Reflections success gets to share in the upside through stock options.
Health & wellness: Comprehensive medical dental vision and life with an annual wellness allowance.
Meals: Lunch and dinner are provided in the office daily.
Life & family: 22 weeks paid parental leave for all new birthing and non-birthing parents including adoptive and surrogate journeys.
Vacation days: Unlimited paid time off in the U.S. and 30 days in the U.K.
Sponsorship support: We sponsor visas to help exceptional talent join our team and support long-term immigration pathways where applicable.
Team building: We have regular off-sites happy hours and team celebrations.