FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Senior Site Reliability Engineer – Cloud and Networking
Akamai TechnologiesSenior Site Reliability Engineer responsible for reliability of Akamai load balancing infrastructure. Designing SLO frameworks and leading incident management while mentoring junior engineers.
Tech Stack
Tools & technologiesAnsibleCloudDistributed SystemsGoGrafanaKubernetesLinuxPrometheusPythonSaltStackTerraform
About the role
Key responsibilities & impact- Owning the SRE lifecycle for NodeBalancer and Network Load Balancer — from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
- Designing and implementing SLO/SLI frameworks that reflect true customer experience for L4 and L7 load balancing services, and driving action when error budgets are at risk
- Building and maintaining observability pipelines for NB/NLB infrastructure, including Prometheus metrics from load balancing components and system-level sources, and Grafana dashboards that enable rapid incident triage
- Leading technical incident response for complex NB/NLB failures — BGP/VIP issues, failover failures, data plane degradations, and configuration problems — acting as the technical commander and driving root cause analysis and preventive follow-through
- Developing and automating safe deployment workflows for phased NB/NLB releases, including bake period monitoring, feature flag management, and GO/NO-GO validation across global datacenter rollouts
- Reviewing design documents, product requirement Documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
- Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability
- Mentoring SRE II engineers on the NB team, providing hands-on technical guidance, code/config reviews, and raising the bar for the team's SRE practice
- Participating in an on-call rotation for NB/NLB production systems, responding to incidents and driving resolution for customer-facing load balancing infrastructure
- Participate in a scheduled, daytime-only on-call rotation to spearhead technical incident response and resolve complex NB/NLB failures.
Requirements
What you’ll need- Have extensive experience in SRE, platform engineering, or infrastructure engineering, working with large-scale distributed systems
- Demonstrate deep expertise with Linux networking fundamentals — routing, BGP, nftables/iptables, ARP, VXLAN — and comfort diagnosing at the packet level using tcpdump, netstat, and similar tools
- Have hands-on experience with L4/L7 load balancing technologies — including proxy-based or kernel-level load balancers — covering configuration, health checking, high availability, and failure modes at scale
- Show a track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
- Demonstrate expertise in Kubernetes and containerization at scale — including workload scheduling, networking (CNI, Services, ingress), resource management, and operating stateful or network-intensive workloads in a cluster environment
- Build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and strong deployment safety instincts
- Demonstrate 4+ years in SRE or infrastructure engineering, with at least 2 years at cloud scale
Benefits
Comp & perks- Your health
- Your finances
- Your family
- Your time at work
- Your time pursuing other endeavors
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
SREplatform engineeringinfrastructure engineeringLinux networkingBGPload balancingKubernetesPythonGoinfrastructure-as-code
Soft Skills
mentoringincident managementtechnical guidanceroot cause analysiscommunication