The AI Search u0026 Knowledge Platform Cloud Infrastructure Team within Apples Services organization designs builds and scales the foundational systems that power Search and next-generation machine learning workloads. We are reimagining how infrastructure is managed through agentic event-driven workflows Crossplane compositions and self-healing control planes.
This Software Development Engineer role will encompass the entire lifecycle of ML compute platform reliability engineering. The engineer will address user queries or tickets triage and mitigate issues converting diagnosis processes and solutions from ad hoc to systematic reactive to proactive and manual to automatic. They will visualize ML platform scalability and stability assessing their impact on development velocity and compute resource utilization. Based on the actual impact they will prioritize engineering efforts across teams enhancing the systems key performance indicators.
Ensure user queries or tickets to be responded in time nEvaluate current visibility for state and performance of the systemnDefine and monitor system key performance indicesnDesign and implement operational tools and protocols along with CI/CD processesnAs a technical leader motivate and communicate cross-teams drive the understanding of problems and the best practice of solutions
Ability of analyzing problems in depth determining root cause articulate clearly and propose solutionsnSolid understanding of system architecture and large-scale ML service and computational platform operationsnAbility of driving a project starting from problem statement requirement and criteria definition solution design implementation deployment until post-deployment operations; achieving the goal through a teamwork or even cross-team collaborationsnProficiency in coding with scripting and programming languages including but not limited to - Bash Python Golangn7 years experience of software development for compute infra or its operational stack commensurate with operating cutting-edge hybrid cloud platforms
Knowledge of ML including LLM as well as experience in developing real large scale ML jobsnKnowledge of ML training and production workflows understanding dependencies among architectural building blocksnKnowledge of analytics method and pipelines able to utilize it for visualization of platform KPIsnExperience designing and implementing systems to support ML applicationsnExperience in large-scale service and job deployment using an orchestration framework (Kubernetes) and cloud services for large-scale projectsnExperience in observability of system behaviors having made decision what should be visible according to actual needs to solve specific problemnExperience and knowledge on Quality Assurance A/B testing for large-scale systems
Required Experience:
Senior IC
The AI Search u0026 Knowledge Platform Cloud Infrastructure Team within Apples Services organization designs builds and scales the foundational systems that power Search and next-generation machine learning workloads. We are reimagining how infrastructure is managed through agentic event-driven work...
The AI Search u0026 Knowledge Platform Cloud Infrastructure Team within Apples Services organization designs builds and scales the foundational systems that power Search and next-generation machine learning workloads. We are reimagining how infrastructure is managed through agentic event-driven workflows Crossplane compositions and self-healing control planes.
This Software Development Engineer role will encompass the entire lifecycle of ML compute platform reliability engineering. The engineer will address user queries or tickets triage and mitigate issues converting diagnosis processes and solutions from ad hoc to systematic reactive to proactive and manual to automatic. They will visualize ML platform scalability and stability assessing their impact on development velocity and compute resource utilization. Based on the actual impact they will prioritize engineering efforts across teams enhancing the systems key performance indicators.
Ensure user queries or tickets to be responded in time nEvaluate current visibility for state and performance of the systemnDefine and monitor system key performance indicesnDesign and implement operational tools and protocols along with CI/CD processesnAs a technical leader motivate and communicate cross-teams drive the understanding of problems and the best practice of solutions
Ability of analyzing problems in depth determining root cause articulate clearly and propose solutionsnSolid understanding of system architecture and large-scale ML service and computational platform operationsnAbility of driving a project starting from problem statement requirement and criteria definition solution design implementation deployment until post-deployment operations; achieving the goal through a teamwork or even cross-team collaborationsnProficiency in coding with scripting and programming languages including but not limited to - Bash Python Golangn7 years experience of software development for compute infra or its operational stack commensurate with operating cutting-edge hybrid cloud platforms
Knowledge of ML including LLM as well as experience in developing real large scale ML jobsnKnowledge of ML training and production workflows understanding dependencies among architectural building blocksnKnowledge of analytics method and pipelines able to utilize it for visualization of platform KPIsnExperience designing and implementing systems to support ML applicationsnExperience in large-scale service and job deployment using an orchestration framework (Kubernetes) and cloud services for large-scale projectsnExperience in observability of system behaviors having made decision what should be visible according to actual needs to solve specific problemnExperience and knowledge on Quality Assurance A/B testing for large-scale systems
Ask Siri to name the most successful company in the world and it might respond: Apple. And it's not just out of familial pride. Apple consistently ranks highly in profit, revenue, market capitalization, and consumer cachet. In 2018, the company became the first reach a trillion dollar
... View more