AI Infrastructure Support Specialist
Job Summary
Role Summary
Responsible for day-to-day support of a high-availability AI/ML platform ensuring stability across Linux systems Kubernetes environments and enterprise infrastructure. The role focuses on operational support incident management and basic troubleshooting in a regulated production environment.
Key Responsibilities
Perform routine platform operations (access requests monitoring health checks)
Handle ticket triage follow runbooks and escalate when needed
Manage Linux systems (logs services basic administration)
Support Kubernetes workloads (pods services basic troubleshooting)
Troubleshoot network connectivity issues (TCP/IP DNS)
Maintain documentation runbooks and incident records
Support AI/ML platform operations in collaboration with data teams
Required Skills
Experience in IT infrastructure / platform support (NOC/SRE/Support)
Strong Linux fundamentals
Basic knowledge of networking (TCP/IP DNS)
Understanding of Kubernetes & containerization
Scripting skills in Python and/or JavaScript
Preferred Skills
Exposure to Docker Kubernetes (hands-on)
Familiarity with Ansible/Puppet (automation tools)
Awareness of AI/ML platforms or GPU environments
Key Attributes
Strong troubleshooting and analytical skills
Ability to work in 24x7/shift environments
Good communication and documentation skills
Experience
46 years in infrastructure/platform support roles
Required Experience:
IC
Key Skills
- Disaster Recovery
- Active Directory
- Production Environment
- OS
- Windows
- AIX
- Asset Management
- ITIL
- Linux
- Perl
- Java
- Business Units
- Uat
- UNIX
- Architecture