Tech Stack
AnsibleCloudGrafanaKubernetesLinuxOpenStackPrometheusPythonTerraform
About the role
- Design GPU-optimized Kubernetes clusters and build multi-tenant GPU infrastructure as part of GPUaaS offering.
- Architect and deploy OpenStack and Kubernetes clusters designed for GPU scheduling, high performance, and multi-tenant workloads.
- Automate deployment pipelines for cloud infrastructure using Terraform, Ansible, Helm, and Kubernetes Operators.
- Build and manage GPU-ready container runtimes, NVIDIA device plugins, and Kubernetes-native GPU provisioning frameworks.
- Ensure high availability and performance of OpenStack and Kubernetes clusters using tools such as Prometheus, Grafana, Loki, and Thanos.
- Implement secure namespace isolation, RBAC, and network policies across OpenStack and Kubernetes layers.
- Collaborate cross-functionally with DevOps, AI, Support, and Product teams to align infrastructure services with platform goals.
- Contribute to automation, observability, and CI/CD tooling across the platform and provide guidance on Kubernetes best practices.
Requirements
- 5+ years of experience with OpenStack in production environments.
- 3+ years of experience managing production-grade Kubernetes clusters, including bare-metal or private cloud environments.
- Strong hands-on expertise with Kubernetes operators, Helm, and custom resource definitions (CRDs).
- Experience with GPU orchestration in Kubernetes using NVIDIA tools.
- Experience with multi-cluster or federated Kubernetes.
- Proficiency in Linux, Ceph, networking (Calico/Cilium), and infrastructure scripting (Python, Bash).
- Strong knowledge of cloud-native security, policy frameworks, and service meshes.
- Experience with CI/CD pipelines, GitOps, and infrastructure-as-code tooling (Terraform, Ansible, ArgoCD).
- Deep OpenStack and strong Kubernetes expertise.
- Good to have: Experience integrating Kubernetes with OpenStack.
- Good to have: Prior contributions to Kubernetes SIGs or CNCF projects.
- Good to have: Knowledge of GPU metering, billing, and quota enforcement.
- Good to have: Familiarity with HPC environments, InfiniBand/ROCEv2 networking, or Slurm integration.