FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Site Reliability Engineer
Akamai TechnologiesOwn operational reliability of cloud load balancing infrastructure serving global customers. Design and implement frameworks reflecting customer experience for reliability management.
Tech Stack
Tools & technologiesAnsibleDistributed SystemsGoKubernetesLinuxPythonSaltStackTerraform
About the role
Key responsibilities & impact- Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
- Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk
- Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage
- Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through
- Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts
- Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
- Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability
Requirements
What you’ll need- 8+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
- Demonstrate deep expertise with Linux networking fundamentals and diagnosing at the packet level using tcpdump, netstat, and similar tools
- Have hands-on experience with L4/L7 load balancing technologies covering configuration, health checking, high availability, and failure modes at scale
- Show a track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
- Demonstrate expertise in Kubernetes and containerization at scale including workload scheduling, networking, resource management, and operating stateful or network-intensive workloads in a cluster environment
- Build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and deployment safety instincts.
Benefits
Comp & perks- healthcare
- RRSP
- company holidays
- vacation (in the form of PTO)
- sick time
- family friendly benefits including employee assistance program including a focus on mental and financial wellness
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SREinfrastructure engineeringplatform engineeringLinux networkingload balancingSLO/SLI frameworksobservability platformsKubernetesPythonGo
Soft Skills
technical incident responseroot cause analysispreventive follow-throughactionable inputcapacity implicationsoperational risksteam-wide operational capability