Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
TELUS Digital

Lead DevOps Engineer

TELUS Digital

Lead DevOps Engineer at TELUS Digital, overseeing infrastructure and reliability practices for AI-powered systems. Collaboration across global teams to ensure robust performance and observability.

Posted 6/22/2026full-timeSão Paulo • 🇧🇷 BrazilSeniorWebsite

Tech Stack

Tools & technologies
AWSCloudDistributed SystemsGoogle Cloud PlatformJavaScriptKubernetesPythonTerraform

About the role

Key responsibilities & impact
  • Lead the architecture and maintenance of the infrastructure and reliability practices that keep AI-powered systems performant, observable, and trustworthy under real production load, including redundancy, latency, and cost management.
  • Help define SLOs/SLIs for AI-powered services, including latency and quality SLOs for LLM inference paths, and build the error-budget discipline that lets product teams ship fast without breaking trust.
  • Design scalable, secure infrastructure for distributed AI services, event-driven workloads, and multi-LLM-provider integrations.
  • Build metrics, tracing, and alerting that surface not just 'is it up' but 'is it behaving correctly' for LLM-powered features (drift, regression, hallucination rates, tool-call failures).
  • Define and enforce PRR-style standards across teams launching new AI products and features.
  • Mentor engineers, drive architecture reviews, and shape the broader engineering culture around reliability.

Requirements

What you’ll need
  • Significant infrastructure engineering experience combining DevOps and SRE disciplines at scale
  • Deep GCP expertise (AWS a strong plus); relevant cloud certifications welcome
  • Production experience with SRE fundamentals: SLO/SLI design, error budgets, toil reduction, blameless incident review
  • Strong background in distributed systems failure modes and resilience patterns
  • Expert-level infrastructure-as-code (Terraform), container orchestration (Kubernetes), and CI/CD
  • Hands-on with modern observability stacks (i.e., OpenTelemetry, Sentry) and AI-specific observability tooling (Arize, LangSmith, Braintrust, or similar)
  • Experience with API management platforms, particularly Apigee and Cloud Run
  • Comfort working across Python, Javascript, and Bash for infra tooling
  • Strong spoken and written communication in English with teams and stakeholders.

Benefits

Comp & perks
  • WFN culture designed to foster in-person innovation, collaboration, and connection with team members local and visiting from other global offices.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
infrastructure engineeringDevOpsSRESLO designSLI designerror budgetsinfrastructure-as-codeTerraformKubernetesCI/CD
Soft Skills
mentoringarchitecture reviewsengineering culturecommunication