FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior AI Infrastructure, Platform Operations Engineer
MirantisSenior AI Infrastructure Engineer at Mirantis managing large-scale AI infrastructures powered by NVIDIA GPUs and Kubernetes. Leading technical operations and incident management with a focus on platform reliability and automation.
Tech Stack
Tools & technologiesCloudDistributed SystemsGrafanaKubernetesLinuxPrometheus
About the role
Key responsibilities & impact- Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
- Act as a senior escalation point for operational teams during critical service-impacting events.
- Support large-scale NVIDIA GPU infrastructure and high-performance networking environments.
- Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues.
- Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks.
- Lead root cause analysis activities and drive long-term corrective actions.
- Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges.
- Participate in major incident management and service restoration activities.
- Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
- Drive improvements in platform reliability, observability, monitoring, and operational processes.
- Identify opportunities to automate repetitive operational activities and improve operational efficiency.
- Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions.
- Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI.
- Evaluate emerging technologies and operational practices to improve service delivery and platform resilience.
- Mentor and support AI Infrastructure & Platform Operations Engineers.
- Share technical knowledge through documentation, training sessions, and operational reviews.
- Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices.
- Help define operational processes, escalation paths, and service reliability standards.
- Act as a trusted technical advisor during operational planning and service improvement initiatives.
Requirements
What you’ll need- 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles.
- Expert-level Linux administration and troubleshooting skills.
- Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues.
- Strong experience operating Kubernetes in production environments.
- Experience supporting large-scale production infrastructure and distributed systems.
- Proven experience leading technical investigations and managing complex incidents.
- Experience performing root cause analysis and driving long-term operational improvements.
- Strong understanding of observability, monitoring, and service reliability practices.
- Excellent troubleshooting and analytical skills across multiple infrastructure domains.
- Strong communication, collaboration, and stakeholder management skills.
- Experience in one or more of the following areas is highly desirable: NVIDIA GPU infrastructure and accelerated computing platforms, InfiniBand networking and NVIDIA UFM, AI infrastructure environments, HPC environments, Platform Engineering or Site Reliability Engineering (SRE), Large-scale Kubernetes operations, Infrastructure automation technologies and Infrastructure-as-Code practices, Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry, Performance analysis and optimisation of distributed infrastructure platforms, Technical leadership, mentoring, or team lead responsibilities.
Benefits
Comp & perks- Operate some of the most advanced AI infrastructure environments in production today.
- Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.
- Help define operational standards and reliability practices for next-generation AI infrastructure services.
- Influence the adoption of AI-powered operational capabilities through k0rdent AI.
- Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale.
- Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux AdministrationKubernetesNetworking TroubleshootingRoot Cause AnalysisPerformance AnalysisInfrastructure OperationsSite Reliability EngineeringCloud OperationsDistributed SystemsOperational Improvements
Soft Skills
Communication SkillsCollaboration SkillsStakeholder ManagementMentoring SkillsAnalytical Skills