Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
TensorWave

Staff Infrastructure Engineer – Kubernetes Platform

TensorWave

Kubernetes Platform Staff Infrastructure Engineer responsible for design and operational reliability of Kubernetes control plane architecture for a leading AI cloud platform.

Posted 6/10/2026full-timeLas Vegas • Nevada • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
Distributed SystemsGrafanaKubernetesLinuxPrometheus

About the role

Key responsibilities & impact
  • Own the design, evolution, and operational reliability of our Kubernetes control plane architecture
  • Design and evolve Kubernetes control plane architecture across regions
  • Define and implement multi-tenant cluster models, including shared control planes, virtual cluster approaches (e.g., vcluster, Kamaji)
  • Drive transition from standalone clusters to regionally managed platform models
  • Define standards for isolation boundaries, resource segmentation, policy enforcement
  • Own the reliability and behavior of Kubernetes platforms in production
  • Participate in on-call rotation and lead incident response
  • Diagnose and resolve control plane instability, API server saturation, scheduling and resource contention issues
  • Ensure consistent lifecycle management across clusters - provisioning, upgrades, scaling
  • Design and implement strategies for regional scaling, multi-data center cluster deployments
  • Ensure consistent behavior and reliability across environments
  • Define cluster topology and failure domain strategies
  • Design ingress and egress architectures at cluster level and regional level
  • Troubleshoot and optimize pod-to-pod networking, north-south traffic flows, CNI behavior (Cilium preferred)
  • Collaborate with network engineering on high-performance networking integration
  • Improve observability across control plane components, cluster health and performance
  • Define and implement resilience strategies aligned with platform goals
  • Lead root cause analysis for production incidents
  • Work closely with DevOps engineers (automation and CI/CD) and Infrastructure teams (compute, storage, networking)
  • Align Kubernetes platform design with underlying infrastructure capabilities

Requirements

What you’ll need
  • 7+ years of experience in infrastructure, platform engineering, or distributed systems
  • Deep experience operating Kubernetes at scale in production environments
  • Experience in CSP, hyperscale, or equivalent large-scale environments strongly preferred
  • Proven experience scaling Kubernetes across:
  • - Multiple clusters
  • - Multiple regions or data centers
  • Strong understanding of Kubernetes internals:
  • - API server
  • - Scheduler
  • - Controller manager
  • - etcd
  • Experience designing or evolving:
  • - Control plane architectures
  • - Multi-tenant cluster models
  • Strong Linux systems expertise
  • Deep troubleshooting ability across:
  • - Kubernetes
  • - Container runtime
  • - Networking stack
  • Experience with CNI plugins (Cilium preferred)
  • Strong understanding of:
  • - Networking and traffic patterns
  • - Resource isolation and scheduling
  • Experience with virtual cluster technologies (vcluster, Kamaji, or similar) (preferred)
  • Experience supporting GPU workloads in Kubernetes (preferred)
  • Familiarity with:
  • - NUMA-aware scheduling (preferred)
  • - Topology-aware workloads (preferred)
  • - Awareness of RDMA and high-throughput networking environments (preferred)
  • Experience with observability platforms (Prometheus, Grafana, etc.) (preferred)

Benefits

Comp & perks
  • Stock Options
  • 100% paid Medical, Dental, and Vision insurance for Employees
  • Company Health Savings Account Contributions
  • 100% paid Short Term and Long Term Disability Insurance for Employees
  • Life and Voluntary Supplemental Insurance Options
  • Other Insurance Options, such as Pet & Legal Insurance
  • Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
  • Flexible Spending Account
  • 401(k)
  • Employee Assistance Program
  • Flexible PTO
  • Paid Holidays
  • Parental Leave
  • Other In-Office Perks

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Kubernetescontrol plane architecturemulti-tenant cluster modelsAPI serverschedulercontroller manageretcdLinux systemsCNI pluginsobservability
Soft Skills
troubleshootingincident responsecollaborationleadershiproot cause analysis