
Principal Site Reliability Engineer
Walmart
full-time
Posted on:
Location Type: Hybrid
Location: Bentonville • California • United States
Visit company websiteExplore more
Salary
💰 $110,000 - $220,000 per year
Job Level
Tech Stack
About the role
- Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
- Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
- Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
- Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
- Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
- Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
- Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
- Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
Requirements
- 10+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
- A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
- Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
- Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
- Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
- Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
- A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
Benefits
- Health benefits include medical, vision and dental coverage.
- Financial benefits include 401(k), stock purchase and company-paid life insurance.
- Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
- Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
- Walmart-paid education benefit program for full-time and part-time associates, covering tuition, books, and fees.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
monitoring frameworksobservability frameworksautomation toolsroot cause analysisSLIsSLOsSLAsdistributed systemsCI/CD pipelinesreliability patterns
Soft Skills
leadershipcommunicationcollaborationinfluencingmentoringjudgmentcuriosityproblem-solvingcontinuous learningoperational excellence