Tech Stack
AnsibleCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonTerraform
About the role
- Define and execute Nscale’s observability strategy across all infrastructure and services.
- Build and manage a global observability engineering team.
- Deploy, operate, and scale observability platforms (Prometheus, Grafana, Loki, tracing/alerting systems).
- Ensure comprehensive instrumentation across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration.
- Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health.
- Integrate observability with incident management and fleet automation.
- Drive down MTTD and MTTR through proactive monitoring and automated remediation.
- Deliver executive-level reporting on system health, capacity, and reliability trends.
- Stay ahead of industry trends in observability, AIOps, and AI workload telemetry.
Requirements
- 10+ years of experience in large-scale infrastructure, SRE, or observability roles.
- Leadership experience managing distributed engineering teams.
- Strong expertise with observability tools (Prometheus, Grafana, Loki, alerting systems).
- Deep understanding of distributed systems, networking, and cloud-native architectures.
- Proficiency in automation and scripting (Python, Go, Bash).
- Hands-on experience with Kubernetes and container orchestration.
- Experience in improving incident response processes and operational reliability.
- Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics) (Nice to Have).
- Familiarity with HPC environments (Slurm, RDMA, InfiniBand) (Nice to Have).
- Knowledge of Infrastructure-as-Code (Terraform, Pulumi, Ansible) (Nice to Have).
- Awareness of sustainability and efficiency practices in data centre observability (Nice to Have).