Senior Site Reliability Engineer

Akamai Technologies

Own operational reliability of cloud load balancing infrastructure serving global customers. Design and implement frameworks reflecting customer experience for reliability management.

Posted 6/11/2026full-timeRemote • 🇨🇦 CanadaSenior💰 CA$120,400 - CA$216,600 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

SREinfrastructure engineeringplatform engineeringLinux networkingload balancingSLO/SLI frameworksobservability platformsKubernetesPythonGo

Soft Skills

technical incident responseroot cause analysispreventive follow-throughactionable inputcapacity implicationsoperational risksteam-wide operational capability

Tools & Technologies

tcpdumpnetstatSaltStackAnsibleTerraformfeature flag managementobservability pipelinesglobal datacenter rolloutsbake-period monitoringincident management

Industry Keywords

large-scale distributed systemshigh availabilityfailure modescontainerizationworkload schedulingresource managementstateful workloadsnetwork-intensive workloads

Tech Stack

Tools & technologies

AnsibleDistributed SystemsGoKubernetesLinuxPythonSaltStackTerraform

About the role

Key responsibilities & impact

Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk
Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage
Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through
Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts
Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability

Requirements

What you’ll need

8+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
Demonstrate deep expertise with Linux networking fundamentals and diagnosing at the packet level using tcpdump, netstat, and similar tools
Have hands-on experience with L4/L7 load balancing technologies covering configuration, health checking, high availability, and failure modes at scale
Show a track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
Demonstrate expertise in Kubernetes and containerization at scale including workload scheduling, networking, resource management, and operating stateful or network-intensive workloads in a cluster environment
Build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and deployment safety instincts.

Benefits

Comp & perks

healthcare
RRSP
company holidays
vacation (in the form of PTO)
sick time
family friendly benefits including employee assistance program including a focus on mental and financial wellness