
Principal Site Reliability Engineer
Walmart
full-time
Posted on:
Location Type: Office
Location: Bentonville • United States
Visit company websiteExplore more
Salary
💰 $110,000 - $220,000 per year
Job Level
Tech Stack
About the role
- Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
- Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
- Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
- Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
- Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
- Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
- Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
- Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.
Requirements
- Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and5 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.
- Option 2: 7 years’ experience in site reliability engineering, site and system administration, infrastructure management, or related area.
- 10+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
- A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
- Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
- Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
- Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
- A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
Benefits
- Health benefits include medical, vision and dental coverage.
- Financial benefits include 401(k), stock purchase and company-paid life insurance.
- Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
- Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
- Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
site reliability engineeringinfrastructure managementautomation toolsmonitoring frameworksobservability practicesdistributed systemsCI/CD pipelinessystem performance analysisreliability patternsresilient infrastructure
Soft Skills
leadershipcollaborationjudgmentmentoringcuriosityproblem-solvingcommunicationcontinuous learningoperational excellenceshared ownership
Certifications
Bachelor's degree in computer scienceBachelor's degree in computer engineeringBachelor's degree in computer information systemsBachelor's degree in software engineering