Walmart

Senior Site Reliability Engineer

Walmart

full-time

Posted on:

Location Type: Office

Location: SunnyvaleCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $117,000 - $234,000 per year

Job Level

About the role

  • Detect and document defects bugs and errors for assigned component module and conducts analysis to determine the sources under guidance.
  • Troubleshoot performance and availability bottlenecks for assigned application under guidance.
  • Utilize established criteria for example probability of failure frequency of failure to measure site reliability.
  • Monitors site reliability conditions and new reliability requirements.
  • Assists in the design and development of a reliability program plan for a specific site environment.
  • Applies appropriate tools services or applications for reliability prediction and other site improvements.
  • Researches and assesses various reliability models for different site environments.
  • Assist in the creation of simple modular extensible and functional design for the product solution in adherence to the requirements.
  • Evaluate tradeoffs while designing across multiple components in a system based on the business requirements.
  • Convert HLD to create detailed design for specific modules components of a product system.
  • Understand nuances of designing for disaster recovery.
  • Undertake infrastructure coding automation.
  • Create and configure minimalistic Less Complex Highly Robust and high-quality code for a component module under guidance.
  • Maintain records by documenting program development and revisions.
  • Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery.
  • Identify repetitive and routine tasks in Continuous Integration Continuous Delivery CICD Testing or any other process that can be automated.
  • Implement telemetry features as required under guidance.
  • Apply security policy requirements to component module during code development configuration.
  • Work with business partners to identify and document critical applications.
  • Interprets and follows procedures in contingency plans.
  • Explain the contingency and disaster recovery plans for assigned environment.
  • Execute established procedures necessary to continue operations in an emergency.
  • Participate in the design of a minimum operating environment for a computer based facility.
  • Suggest metrics to monitor software or system performance.
  • Monitor current performance data to ensure compliance with defined SLOs for multiple applications systems.
  • Determine thresholds for monitoring metrics and triggers alerts based on thresholds.
  • Supervise specific procedures to proactively check the health of applications and infrastructure including a variety of operating systems hardware and software.

Requirements

  • Master’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.
  • Experience designing and implementing performance test strategies for complex web, mobile, API, and backend systems for Jira and Confluence data center instances.
  • Experience building and maintaining automated performance test scripts using tools including JMeter, Gatling, LoadRunner, and k6.
  • Experience performing root cause analysis of performance issues in production and test environments for Jira and Confluence Data Center Instances, identifying CPU, memory, database, thread, and network bottlenecks.
  • Experience monitoring system health, performance, and usage using tools including Grafana, Splunk, and Dynatrace, and ensuring compliance with internal SLAs.
  • Experience designing and implementing observability (monitoring, logging, alerting) and ensuring SLAs and SLOs are met.
  • Experience designing, implementing, and supporting large-scale Jira Software, Jira Service Management, and Confluence instances.
  • Experience performing upgrades, patching, plugin management, and performance tuning for Atlassian platforms.
  • Experience in integrating enterprise platforms with CI/CD pipelines, and observability tools to automate workflows, improve incident response, and enhance system reliability.
  • Experience managing infrastructure components including Linux servers, databases, and storage supporting Atlassian tools in both on-prem and cloud environments.
  • Experience working on scripting languages including Groovy, Bash and PowerShell to automate tasks on Linux and Windows.
  • Experience implementing and maintaining backup, recovery, and disaster recovery plans for Atlassian tools.
Benefits
  • Health benefits include medical, vision and dental coverage.
  • Financial benefits include 401(k), stock purchase, and company-paid life insurance.
  • Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
  • Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
site reliability engineeringperformance testingroot cause analysisinfrastructure managementdisaster recoveryscripting languagesautomationobservabilitymonitoringlogging
Soft Skills
troubleshootinganalytical skillscommunicationcollaborationproblem-solvingattention to detailorganizational skillsadaptabilitycritical thinkingtime management