FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Staff Infrastructure Engineer – Kubernetes Platform
TensorWaveKubernetes Platform Staff Infrastructure Engineer responsible for design and operational reliability of Kubernetes control plane architecture for a leading AI cloud platform.
Tech Stack
Tools & technologiesDistributed SystemsGrafanaKubernetesLinuxPrometheus
About the role
Key responsibilities & impact- Own the design, evolution, and operational reliability of our Kubernetes control plane architecture
- Design and evolve Kubernetes control plane architecture across regions
- Define and implement multi-tenant cluster models, including shared control planes, virtual cluster approaches (e.g., vcluster, Kamaji)
- Drive transition from standalone clusters to regionally managed platform models
- Define standards for isolation boundaries, resource segmentation, policy enforcement
- Own the reliability and behavior of Kubernetes platforms in production
- Participate in on-call rotation and lead incident response
- Diagnose and resolve control plane instability, API server saturation, scheduling and resource contention issues
- Ensure consistent lifecycle management across clusters - provisioning, upgrades, scaling
- Design and implement strategies for regional scaling, multi-data center cluster deployments
- Ensure consistent behavior and reliability across environments
- Define cluster topology and failure domain strategies
- Design ingress and egress architectures at cluster level and regional level
- Troubleshoot and optimize pod-to-pod networking, north-south traffic flows, CNI behavior (Cilium preferred)
- Collaborate with network engineering on high-performance networking integration
- Improve observability across control plane components, cluster health and performance
- Define and implement resilience strategies aligned with platform goals
- Lead root cause analysis for production incidents
- Work closely with DevOps engineers (automation and CI/CD) and Infrastructure teams (compute, storage, networking)
- Align Kubernetes platform design with underlying infrastructure capabilities
Requirements
What you’ll need- 7+ years of experience in infrastructure, platform engineering, or distributed systems
- Deep experience operating Kubernetes at scale in production environments
- Experience in CSP, hyperscale, or equivalent large-scale environments strongly preferred
- Proven experience scaling Kubernetes across:
- - Multiple clusters
- - Multiple regions or data centers
- Strong understanding of Kubernetes internals:
- - API server
- - Scheduler
- - Controller manager
- - etcd
- Experience designing or evolving:
- - Control plane architectures
- - Multi-tenant cluster models
- Strong Linux systems expertise
- Deep troubleshooting ability across:
- - Kubernetes
- - Container runtime
- - Networking stack
- Experience with CNI plugins (Cilium preferred)
- Strong understanding of:
- - Networking and traffic patterns
- - Resource isolation and scheduling
- Experience with virtual cluster technologies (vcluster, Kamaji, or similar) (preferred)
- Experience supporting GPU workloads in Kubernetes (preferred)
- Familiarity with:
- - NUMA-aware scheduling (preferred)
- - Topology-aware workloads (preferred)
- - Awareness of RDMA and high-throughput networking environments (preferred)
- Experience with observability platforms (Prometheus, Grafana, etc.) (preferred)
Benefits
Comp & perks- Stock Options
- 100% paid Medical, Dental, and Vision insurance for Employees
- Company Health Savings Account Contributions
- 100% paid Short Term and Long Term Disability Insurance for Employees
- Life and Voluntary Supplemental Insurance Options
- Other Insurance Options, such as Pet & Legal Insurance
- Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
- Flexible Spending Account
- 401(k)
- Employee Assistance Program
- Flexible PTO
- Paid Holidays
- Parental Leave
- Other In-Office Perks
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Kubernetescontrol plane architecturemulti-tenant cluster modelsAPI serverschedulercontroller manageretcdLinux systemsCNI pluginsobservability
Soft Skills
troubleshootingincident responsecollaborationleadershiproot cause analysis