Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Mirantis

Senior AI Infrastructure, Platform Operations Engineer

Mirantis

Senior AI Infrastructure Engineer at Mirantis managing large-scale AI infrastructures powered by NVIDIA GPUs and Kubernetes. Leading technical operations and incident management with a focus on platform reliability and automation.

Posted 7/1/2026full-timeRemote • 🇪🇺 Anywhere in EuropeSeniorWebsite

Tech Stack

Tools & technologies
CloudDistributed SystemsGrafanaKubernetesLinuxPrometheus

About the role

Key responsibilities & impact
  • Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents.
  • Act as a senior escalation point for operational teams during critical service-impacting events.
  • Support large-scale NVIDIA GPU infrastructure and high-performance networking environments.
  • Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues.
  • Analyze platform performance, capacity, stability, and reliability trends to proactively identify risks.
  • Lead root cause analysis activities and drive long-term corrective actions.
  • Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve complex technical challenges.
  • Participate in major incident management and service restoration activities.
  • Provide technical leadership for Kubernetes platform operations and supporting infrastructure services.
  • Drive improvements in platform reliability, observability, monitoring, and operational processes.
  • Identify opportunities to automate repetitive operational activities and improve operational efficiency.
  • Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions.
  • Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI.
  • Evaluate emerging technologies and operational practices to improve service delivery and platform resilience.
  • Mentor and support AI Infrastructure & Platform Operations Engineers.
  • Share technical knowledge through documentation, training sessions, and operational reviews.
  • Develop and maintain operational standards, runbooks, troubleshooting guides, and best practices.
  • Help define operational processes, escalation paths, and service reliability standards.
  • Act as a trusted technical advisor during operational planning and service improvement initiatives.

Requirements

What you’ll need
  • 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles.
  • Expert-level Linux administration and troubleshooting skills.
  • Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues.
  • Strong experience operating Kubernetes in production environments.
  • Experience supporting large-scale production infrastructure and distributed systems.
  • Proven experience leading technical investigations and managing complex incidents.
  • Experience performing root cause analysis and driving long-term operational improvements.
  • Strong understanding of observability, monitoring, and service reliability practices.
  • Excellent troubleshooting and analytical skills across multiple infrastructure domains.
  • Strong communication, collaboration, and stakeholder management skills.
  • Experience in one or more of the following areas is highly desirable: NVIDIA GPU infrastructure and accelerated computing platforms, InfiniBand networking and NVIDIA UFM, AI infrastructure environments, HPC environments, Platform Engineering or Site Reliability Engineering (SRE), Large-scale Kubernetes operations, Infrastructure automation technologies and Infrastructure-as-Code practices, Observability platforms such as Grafana, Prometheus, ELK, or OpenTelemetry, Performance analysis and optimisation of distributed infrastructure platforms, Technical leadership, mentoring, or team lead responsibilities.

Benefits

Comp & perks
  • Operate some of the most advanced AI infrastructure environments in production today.
  • Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.
  • Help define operational standards and reliability practices for next-generation AI infrastructure services.
  • Influence the adoption of AI-powered operational capabilities through k0rdent AI.
  • Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale.
  • Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux AdministrationKubernetesNetworking TroubleshootingRoot Cause AnalysisPerformance AnalysisInfrastructure OperationsSite Reliability EngineeringCloud OperationsDistributed SystemsOperational Improvements
Soft Skills
Communication SkillsCollaboration SkillsStakeholder ManagementMentoring SkillsAnalytical Skills