Director of Observability

Nscale

full-time

Posted on: 9/22/2025

Origin: • 🇬🇧 United Kingdom

✨ AI Apply

Lead

AnsibleCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonTerraform

About the role

Define and execute Nscale’s observability strategy across all infrastructure and services.
Build and manage a global observability engineering team.
Deploy, operate, and scale observability platforms (Prometheus, Grafana, Loki, tracing/alerting systems).
Ensure comprehensive instrumentation across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration.
Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health.
Integrate observability with incident management and fleet automation.
Drive down MTTD and MTTR through proactive monitoring and automated remediation.
Deliver executive-level reporting on system health, capacity, and reliability trends.
Stay ahead of industry trends in observability, AIOps, and AI workload telemetry.

10+ years of experience in large-scale infrastructure, SRE, or observability roles.
Leadership experience managing distributed engineering teams.
Strong expertise with observability tools (Prometheus, Grafana, Loki, alerting systems).
Deep understanding of distributed systems, networking, and cloud-native architectures.
Proficiency in automation and scripting (Python, Go, Bash).
Hands-on experience with Kubernetes and container orchestration.
Experience in improving incident response processes and operational reliability.
Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics) (Nice to Have).
Familiarity with HPC environments (Slurm, RDMA, InfiniBand) (Nice to Have).
Knowledge of Infrastructure-as-Code (Terraform, Pulumi, Ansible) (Nice to Have).
Awareness of sustainability and efficiency practices in data centre observability (Nice to Have).