Infrastructure Engineer(kernel API)

Diligent Tec Inc


Job Location:

San Jose, CA - USA

Monthly Salary: Not Disclosed
Posted on: 3 hours ago
Vacancies: 1 Vacancy

Job Summary

Job Title: Infrastructure Engineer(kernel API)
Location: San Jose CA day one onsite (4days)


Job description:

Problem Solving and Deep-Level Troubleshooting: Investigating and troubleshooting problems and hardware faults that our automation cant determine within our GPU platforms. This will involve taking data from system logs kernel logs BMC redfish APIs and if the data is not there working with hardware and kernel engineers to add information you need to make accurate determinations.
Coordination and Collaboration: Working closely with our Data Centre Operations Hardware Engineering and Capacity Planning teams to repair and remediate failed hardware ensure consistent delivery of new hardware to customers and roll out new upgrades across the fleet
Automation and Tool Development: Automate routine processes and build hardware diagnostics provisioning and repair tooling
Build Processes and Documentation: When you figure out the best way to do something youll be working on building processes documentation and tooling to help the next person who finds this problem
Validate and Test new hardware: Crusoe is often the first company in the world to get the latest generation AI hardware before its fully tested. Conducting rigorous testing and validation on such cutting-edge hardware and servers that comes back from repair
On-Call: Participate in our on-call rotation partnering with our US teams to provide follow-the-sun coverage
What Youll Bring to the Team
Strong analytical troubleshooting and problem-solving skills: Our automation takes care of the easy problems youll be digging deep to figure out the hard ones
Linux experience: Youll have solid unAbout the Rolederstanding of Linux internals and feel at home working in a terminal
Server Hardware and Provisioning: Exposure to server-class hardware & provisioning
Fundamentals of Hardware and Networking: You dont need to be an expert but you should know if an error message is due to a failed hardware component a firmware bug or a networking misconfiguration without escalating
Excellent communication and collaboration skills: Youll be working with many different people across a lot of different teams - communication is critical
Education: Bachelors Degree in Computer Science related field or self-educated in computer science fundamentals.
Bonus Points
Large-scale GPU operations: We work with cutting edge hardware and software so we understand most people wont have worked with it - but it would be nice if you have!
Programming Proficiency: Proficiency with at least one programming language (Python Go or similar

Required Skills:

PYTHONAPIKernelKernel-based Virtual Machine (KVM)

Job Title: Infrastructure Engineer(kernel API)Location: San Jose CA day one onsite (4days) Job description: Problem Solving and Deep-Level Troubleshooting: Investigating and troubleshooting problems and hardware faults that our automation cant determine within our GPU platforms. This will involve t...