Tech Stack
AWSCloudDNSDockerElasticSearchFirewallsGrafanaKubernetesLinuxPostgresPrometheusPythonRedisTerraformVault
About the role
- Multi-Environment Kubernetes Architecture - Manage 5 distinct environments (NMS, Sandbox, Development, Staging, Production) with different security requirements and design redundancy/failover mechanisms
- Infrastructure as Code Excellence - Develop and maintain Pulumi-based infrastructure using Python, managing complex cross-environment dependencies and VPC peering relationships
- Zero-Trust Security Implementation - Implement certificate-based VPN access with internal DNS resolution, configure WAF/security groups, and manage HashiCorp Vault integration
- Comprehensive Observability - Deploy and configure Prometheus, Grafana, Loki, Jaeger, and CloudWatch with unified monitoring across distributed infrastructure
- API Platform Management - Deploy and maintain centralized API managing all environments from NMS hub, implementing automation for training jobs and inference optimization
Requirements
- Must Have Qualifications:
- 5+ years in DevOps, SRE, or infrastructure engineering
- Expert-level Kubernetes experience with EKS and multi-cluster management
- Strong Python programming skills for infrastructure automation and API development
- Infrastructure as Code expertise with Pulumi, Terraform, or similar tools
- Deep AWS knowledge: VPC, EKS, ECR, S3, CloudWatch, IAM, and networking
- Linux system administration and containerization with Docker
- Hands-on experience with Prometheus, Grafana, and centralized logging systems
- Network security experience including VPN, firewalls, and certificate management
- Nice to Have Qualifications:
- Machine Learning infrastructure experience (GPU clusters, model serving, ML pipelines)
- HashiCorp Vault administration and integration
- GitOps experience with ArgoCD or similar tools
- Service mesh experience (Istio, Linkerd)
- Database administration (PostgreSQL, Redis, Elasticsearch)
- CI/CD pipeline design and multi-cloud infrastructure experience