Walmart

Principal Site Reliability Engineer

Walmart

full-time

Posted on:

Location Type: Hybrid

Location: BentonvilleCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $110,000 - $220,000 per year

Job Level

About the role

  • Drive the design and evolution of monitoring and observability frameworks that enable proactive detection, root cause analysis, and rapid resolution of customer-impacting incidents.
  • Lead the development and integration of automation tools to streamline operational workflows, reduce toil, and enhance the reliability of customer service platforms.
  • Participate in on-call rotations, applying deep technical expertise to swiftly diagnose and mitigate production issues, ensuring high availability and minimal disruption to customer support experiences.
  • Collaborate closely with engineering teams to embed reliability into the software development lifecycle, championing a culture of shared ownership and “you build it, you run it.”
  • Define and manage SLIs, SLOs, and SLAs to align service reliability with business expectations and continuously improve system performance.
  • Apply proven reliability patterns and practices, leveraging hands-on experience to architect resilient systems that scale with customer demand.
  • Lead post-incident reviews and blameless retrospectives, identifying systemic improvements and fostering a culture of continuous learning and operational excellence.
  • Analyze system performance and advocate for cost-effective optimizations, balancing infrastructure efficiency with world-class service reliability.

Requirements

  • 10+ years of experience engineering and scaling highly available, customer-facing systems with a focus on reliability and operational excellence.
  • A proven ability to lead the design and implementation of resilient infrastructure and automation solutions that solve complex reliability challenges.
  • Strong judgment in making architectural trade-offs, balancing long-term system health with short-term delivery needs.
  • Deep expertise in distributed systems, service ownership models, CI/CD pipelines, and observability practices.
  • Exceptional communication and collaboration skills, with a track record of influencing cross-functional teams and driving consensus on reliability strategies.
  • Experience mentoring engineers in incident response, reliability patterns, and career growth within SRE disciplines.
  • A curious mindset and eagerness to explore new technologies and domains that enhance customer support platforms at scale.
Benefits
  • Health benefits include medical, vision and dental coverage.
  • Financial benefits include 401(k), stock purchase and company-paid life insurance.
  • Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
  • Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more.
  • Walmart-paid education benefit program for full-time and part-time associates, covering tuition, books, and fees.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
monitoring frameworksobservability frameworksautomation toolsroot cause analysisSLIsSLOsSLAsdistributed systemsCI/CD pipelinesreliability patterns
Soft Skills
leadershipcommunicationcollaborationinfluencingmentoringjudgmentcuriosityproblem-solvingcontinuous learningoperational excellence