Lead DevOps Engineer

TELUS Digital

Lead DevOps Engineer at TELUS Digital, overseeing infrastructure and reliability practices for AI-powered systems. Collaboration across global teams to ensure robust performance and observability.

Posted 6/22/2026full-timeSão Paulo • 🇧🇷 BrazilSeniorWebsite

Tech Stack

Tools & technologies

AWSCloudDistributed SystemsGoogle Cloud PlatformJavaScriptKubernetesPythonTerraform

About the role

Key responsibilities & impact

Lead the architecture and maintenance of the infrastructure and reliability practices that keep AI-powered systems performant, observable, and trustworthy under real production load, including redundancy, latency, and cost management.
Help define SLOs/SLIs for AI-powered services, including latency and quality SLOs for LLM inference paths, and build the error-budget discipline that lets product teams ship fast without breaking trust.
Design scalable, secure infrastructure for distributed AI services, event-driven workloads, and multi-LLM-provider integrations.
Build metrics, tracing, and alerting that surface not just 'is it up' but 'is it behaving correctly' for LLM-powered features (drift, regression, hallucination rates, tool-call failures).
Define and enforce PRR-style standards across teams launching new AI products and features.
Mentor engineers, drive architecture reviews, and shape the broader engineering culture around reliability.

Requirements

What you’ll need

Significant infrastructure engineering experience combining DevOps and SRE disciplines at scale
Deep GCP expertise (AWS a strong plus); relevant cloud certifications welcome
Production experience with SRE fundamentals: SLO/SLI design, error budgets, toil reduction, blameless incident review
Strong background in distributed systems failure modes and resilience patterns
Expert-level infrastructure-as-code (Terraform), container orchestration (Kubernetes), and CI/CD
Hands-on with modern observability stacks (i.e., OpenTelemetry, Sentry) and AI-specific observability tooling (Arize, LangSmith, Braintrust, or similar)
Experience with API management platforms, particularly Apigee and Cloud Run
Comfort working across Python, Javascript, and Bash for infra tooling
Strong spoken and written communication in English with teams and stakeholders.

Benefits

Comp & perks

WFN culture designed to foster in-person innovation, collaboration, and connection with team members local and visiting from other global offices.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

infrastructure engineeringDevOpsSRESLO designSLI designerror budgetsinfrastructure-as-codeTerraformKubernetesCI/CD

Soft Skills

mentoringarchitecture reviewsengineering culturecommunication