Nscale

Director of Observability

Nscale

full-time

Posted on:

Origin:  • 🇬🇧 United Kingdom

Visit company website
AI Apply
Apply

Job Level

Lead

Tech Stack

AnsibleCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonTerraform

About the role

  • Define and execute Nscale’s observability strategy across all infrastructure and services.
  • Build and manage a global observability engineering team.
  • Deploy, operate, and scale observability platforms (Prometheus, Grafana, Loki, tracing/alerting systems).
  • Ensure comprehensive instrumentation across GPU clusters, networking fabrics, Kubernetes (NKS/NKS Lite), and Slurm orchestration.
  • Establish and track reliability metrics (SLIs, SLOs, error budgets) to guide service health.
  • Integrate observability with incident management and fleet automation.
  • Drive down MTTD and MTTR through proactive monitoring and automated remediation.
  • Deliver executive-level reporting on system health, capacity, and reliability trends.
  • Stay ahead of industry trends in observability, AIOps, and AI workload telemetry.

Requirements

  • 10+ years of experience in large-scale infrastructure, SRE, or observability roles.
  • Leadership experience managing distributed engineering teams.
  • Strong expertise with observability tools (Prometheus, Grafana, Loki, alerting systems).
  • Deep understanding of distributed systems, networking, and cloud-native architectures.
  • Proficiency in automation and scripting (Python, Go, Bash).
  • Hands-on experience with Kubernetes and container orchestration.
  • Experience in improving incident response processes and operational reliability.
  • Experience with GPU/AI workload observability (e.g. DCGM, model telemetry, prompt analytics) (Nice to Have).
  • Familiarity with HPC environments (Slurm, RDMA, InfiniBand) (Nice to Have).
  • Knowledge of Infrastructure-as-Code (Terraform, Pulumi, Ansible) (Nice to Have).
  • Awareness of sustainability and efficiency practices in data centre observability (Nice to Have).
Docusign

Principal Product Manager - Site Reliability

Docusign
Leadfull-time$174k–$328k / year🇺🇸 United States
Posted: 37 days agoSource: uscareers-docusign.icims.com
AnsibleAWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesPrometheusTerraform
Sharetec Systems

Senior DevOps Engineer

Sharetec Systems
Seniorfull-time$110k–$130k / yearAlabama, Arizona, Colorado, Florida, Idaho, Illinois, Iowa, Kentucky · 🇺🇸 United States
Posted: 5 days agoSource: recruiting.paylocity.com
AnsibleAWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaJavaScriptPrometheusPythonTerraform
Sinch

Senior Site Reliability Engineer

Sinch
Seniorfull-time$143k–$179k / yearColorado, Illinois · 🇺🇸 United States
Posted: 18 days agoSource: apply.workable.com
AnsibleAWSCassandraCloudDistributed SystemsElasticSearchGoGoogle Cloud PlatformGrafanaLinuxMicroservicesPrometheus+2 more
Red Cell Partners

Senior Site Reliability Engineer

Red Cell Partners
Seniorfull-time🇺🇸 United States
Posted: 38 days agoSource: boards.greenhouse.io
AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform
Highspot

Senior Software Development Engineer

Highspot
Seniorfull-time🇮🇳 India
Posted: 19 days agoSource: jobs.lever.co
AnsibleAWSAzureChefCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesMicroservices+5 more