Apple is where individual imaginations gather together committing to the values that lead to great work. Every new product we build service we create or Apple Store experience we deliver is the result of us making each others ideas stronger. That happens because every one of us shares a belief that we can make something wonderful and share it with the world changing lives for the better. Its the diversity of our people and their thinking that inspires the innovation that runs through everything we do. When we bring everybody in we can do the best work of our lives. Here youll do more than join something youll add something!
As a Senior/Staff Engineer on the Foundation Model Compute Infrastructure team you will lead the design and development of scheduling and orchestration systems for large-scale TPU workloads across multi-region will work on distributed systems that manage thousands of accelerators and enable reliable efficient execution of large-scale training and inference jobs. This role spans scheduling algorithms cluster lifecycle management workload orchestration reliability engineering and performance optimization.
Design and evolve large-scale scheduling systems for TPU-based training and inference workloads across multi-region clusters nBuild topology-aware quota-aware and fault-tolerant schedulers to improve utilization fairness startup latency and reliability nDevelop orchestration systems for distributed ML workloads running on Kubernetes and accelerator infrastructure nImprove cluster efficiency and operational scalability through automation of provisioning resource management quota workflows and recovery handling nCollaborate closely with foundation model teams to support advanced distributed training and inference frameworks such as Pathways Ray and JAX-based workloads nMentor engineers and influence architectural direction across Apples distributed AI compute platform
7 years of industry experience building large-scale distributed systems or cloud infrastructure nStrong programming skills in Python Go C or similar systems languages nExtensive experience with compute infrastructure and workload schedulingnStrong expertise in distributed systems scalability reliability and performance engineering nExperience with Kubernetes container orchestration or large-scale cluster management systems nExperience designing backend services or infrastructure platforms operating at production scale nStrong communication and collaboration skills across engineering and research teams nBachelors degree in Computer Science Engineering or related field
Experience building schedulers resource managers or orchestration systems for distributed workloads nExperience with accelerator infrastructure such as TPU GPU nExperience with distributed ML training or inference systems nFamiliarity with frameworks such as JAX PyTorch TensorFlow Ray PathwaysnExperience operating large-scale multi-tenant infrastructure in cloud or hybrid environments nBackground in performance optimization fault tolerance or resource efficiency for large distributed systemsnMS or PhD in Computer Science Engineering or related field
Required Experience:
Staff IC
Apple is where individual imaginations gather together committing to the values that lead to great work. Every new product we build service we create or Apple Store experience we deliver is the result of us making each others ideas stronger. That happens because every one of us shares a belief that ...
Apple is where individual imaginations gather together committing to the values that lead to great work. Every new product we build service we create or Apple Store experience we deliver is the result of us making each others ideas stronger. That happens because every one of us shares a belief that we can make something wonderful and share it with the world changing lives for the better. Its the diversity of our people and their thinking that inspires the innovation that runs through everything we do. When we bring everybody in we can do the best work of our lives. Here youll do more than join something youll add something!
As a Senior/Staff Engineer on the Foundation Model Compute Infrastructure team you will lead the design and development of scheduling and orchestration systems for large-scale TPU workloads across multi-region will work on distributed systems that manage thousands of accelerators and enable reliable efficient execution of large-scale training and inference jobs. This role spans scheduling algorithms cluster lifecycle management workload orchestration reliability engineering and performance optimization.
Design and evolve large-scale scheduling systems for TPU-based training and inference workloads across multi-region clusters nBuild topology-aware quota-aware and fault-tolerant schedulers to improve utilization fairness startup latency and reliability nDevelop orchestration systems for distributed ML workloads running on Kubernetes and accelerator infrastructure nImprove cluster efficiency and operational scalability through automation of provisioning resource management quota workflows and recovery handling nCollaborate closely with foundation model teams to support advanced distributed training and inference frameworks such as Pathways Ray and JAX-based workloads nMentor engineers and influence architectural direction across Apples distributed AI compute platform
7 years of industry experience building large-scale distributed systems or cloud infrastructure nStrong programming skills in Python Go C or similar systems languages nExtensive experience with compute infrastructure and workload schedulingnStrong expertise in distributed systems scalability reliability and performance engineering nExperience with Kubernetes container orchestration or large-scale cluster management systems nExperience designing backend services or infrastructure platforms operating at production scale nStrong communication and collaboration skills across engineering and research teams nBachelors degree in Computer Science Engineering or related field
Experience building schedulers resource managers or orchestration systems for distributed workloads nExperience with accelerator infrastructure such as TPU GPU nExperience with distributed ML training or inference systems nFamiliarity with frameworks such as JAX PyTorch TensorFlow Ray PathwaysnExperience operating large-scale multi-tenant infrastructure in cloud or hybrid environments nBackground in performance optimization fault tolerance or resource efficiency for large distributed systemsnMS or PhD in Computer Science Engineering or related field
Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar
... View more