At Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investment we are delivering key technology into the fast-growing SoftBank AI meet the vast and exciting AI opportunity Graphcore is expanding its teams around the are bringing together the brightest minds to solve the toughest problems in a place where everyone has the opportunity to make an impact on the company our products and the future of artificial intelligence.
Job Summary
Responsible for system-level reliability of AI servers with liquid cooling and HVDC architectures owning reliability validation shock & vibration robustness and failure analysis from board to rack level to ensure safe transport deployment and long-term datacenter operation.
Key Responsibilities and skills
Plan and execute reliability validation across board server and rack levels.
Define and run environmental accelerated and mechanical tests including thermal/power cycling humidity corrosion shock & vibration and HALT/HASS.
Lead shock & vibration validation for transportation handling seismic and operational conditions.
Assess reliability risks for liquid cooling systems (leakage fatigue pump life corrosion coolant stability).
Evaluate HVDC mechanical and electrical robustness (busbars connectors power interfaces).
Perform reliability prediction and life data analysis (Weibull MTBF).
Lead cross-functional design reviews and drive risk mitigation.
Conduct failure analysis and RCA using standard FA methodologies.
Define andmaintainreliability and S&V test specifications (JEDEC Telcordia GR-63 JESD22 MIL-STD-810 ISTA ASHRAE UL IEC).
ImplementOn-going Reliability Test (ORT) for production quality.
Document results and support customer audits and certifications.
Qualifications
Bachelors orMasters degree in Mechanical Electrical Reliability Materials or related Engineering.
10 years of reliability engineering experience in AI servers datacenter systems HPC or complex electronics.
Hands-on experience with environmental shock and vibration testing.
Strong knowledge of reliability methodologies and statistical analysis.
Practical experience with liquid cooling and HVDC systems.
Proven failure analysis and RCA capability.
Strong communicationskills in English; Mandarin a plus.
Preferred Experience
AI server architecture and large-scale liquid cooling systems.
FEA/modal analysis and test correlation.
Datacenter telecom and transportation standards knowledge.
Reliability certification (e.g. ASQ CRE).
Benefits
In addition to a competitive salary Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; were committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.
Required Experience:
IC
About GraphcoreAt Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investmen...
About Graphcore
At Graphcore were building the future of AI a team of semiconductor software and AI experts with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter part of the SoftBank Group backed by significant long-term investment we are delivering key technology into the fast-growing SoftBank AI meet the vast and exciting AI opportunity Graphcore is expanding its teams around the are bringing together the brightest minds to solve the toughest problems in a place where everyone has the opportunity to make an impact on the company our products and the future of artificial intelligence.
Job Summary
Responsible for system-level reliability of AI servers with liquid cooling and HVDC architectures owning reliability validation shock & vibration robustness and failure analysis from board to rack level to ensure safe transport deployment and long-term datacenter operation.
Key Responsibilities and skills
Plan and execute reliability validation across board server and rack levels.
Define and run environmental accelerated and mechanical tests including thermal/power cycling humidity corrosion shock & vibration and HALT/HASS.
Lead shock & vibration validation for transportation handling seismic and operational conditions.
Assess reliability risks for liquid cooling systems (leakage fatigue pump life corrosion coolant stability).
Evaluate HVDC mechanical and electrical robustness (busbars connectors power interfaces).
Perform reliability prediction and life data analysis (Weibull MTBF).
Lead cross-functional design reviews and drive risk mitigation.
Conduct failure analysis and RCA using standard FA methodologies.
Define andmaintainreliability and S&V test specifications (JEDEC Telcordia GR-63 JESD22 MIL-STD-810 ISTA ASHRAE UL IEC).
ImplementOn-going Reliability Test (ORT) for production quality.
Document results and support customer audits and certifications.
Qualifications
Bachelors orMasters degree in Mechanical Electrical Reliability Materials or related Engineering.
10 years of reliability engineering experience in AI servers datacenter systems HPC or complex electronics.
Hands-on experience with environmental shock and vibration testing.
Strong knowledge of reliability methodologies and statistical analysis.
Practical experience with liquid cooling and HVDC systems.
Proven failure analysis and RCA capability.
Strong communicationskills in English; Mandarin a plus.
Preferred Experience
AI server architecture and large-scale liquid cooling systems.
FEA/modal analysis and test correlation.
Datacenter telecom and transportation standards knowledge.
Reliability certification (e.g. ASQ CRE).
Benefits
In addition to a competitive salary Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; were committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.