FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Lead DevOps Engineer
TELUS DigitalLead DevOps Engineer at TELUS Digital, overseeing infrastructure and reliability practices for AI-powered systems. Collaboration across global teams to ensure robust performance and observability.
Tech Stack
Tools & technologiesAWSCloudDistributed SystemsGoogle Cloud PlatformJavaScriptKubernetesPythonTerraform
About the role
Key responsibilities & impact- Lead the architecture and maintenance of the infrastructure and reliability practices that keep AI-powered systems performant, observable, and trustworthy under real production load, including redundancy, latency, and cost management.
- Help define SLOs/SLIs for AI-powered services, including latency and quality SLOs for LLM inference paths, and build the error-budget discipline that lets product teams ship fast without breaking trust.
- Design scalable, secure infrastructure for distributed AI services, event-driven workloads, and multi-LLM-provider integrations.
- Build metrics, tracing, and alerting that surface not just 'is it up' but 'is it behaving correctly' for LLM-powered features (drift, regression, hallucination rates, tool-call failures).
- Define and enforce PRR-style standards across teams launching new AI products and features.
- Mentor engineers, drive architecture reviews, and shape the broader engineering culture around reliability.
Requirements
What you’ll need- Significant infrastructure engineering experience combining DevOps and SRE disciplines at scale
- Deep GCP expertise (AWS a strong plus); relevant cloud certifications welcome
- Production experience with SRE fundamentals: SLO/SLI design, error budgets, toil reduction, blameless incident review
- Strong background in distributed systems failure modes and resilience patterns
- Expert-level infrastructure-as-code (Terraform), container orchestration (Kubernetes), and CI/CD
- Hands-on with modern observability stacks (i.e., OpenTelemetry, Sentry) and AI-specific observability tooling (Arize, LangSmith, Braintrust, or similar)
- Experience with API management platforms, particularly Apigee and Cloud Run
- Comfort working across Python, Javascript, and Bash for infra tooling
- Strong spoken and written communication in English with teams and stakeholders.
Benefits
Comp & perks- WFN culture designed to foster in-person innovation, collaboration, and connection with team members local and visiting from other global offices.
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
infrastructure engineeringDevOpsSRESLO designSLI designerror budgetsinfrastructure-as-codeTerraformKubernetesCI/CD
Soft Skills
mentoringarchitecture reviewsengineering culturecommunication