
Senior Site Reliability Engineer
Walmart
full-time
Posted on:
Location Type: Office
Location: Sunnyvale • California • United States
Visit company websiteExplore more
Salary
💰 $117,000 - $234,000 per year
Job Level
About the role
- Detect and document defects bugs and errors for assigned component module and conducts analysis to determine the sources under guidance.
- Troubleshoot performance and availability bottlenecks for assigned application under guidance.
- Utilize established criteria for example probability of failure frequency of failure to measure site reliability.
- Monitors site reliability conditions and new reliability requirements.
- Assists in the design and development of a reliability program plan for a specific site environment.
- Applies appropriate tools services or applications for reliability prediction and other site improvements.
- Researches and assesses various reliability models for different site environments.
- Assist in the creation of simple modular extensible and functional design for the product solution in adherence to the requirements.
- Evaluate tradeoffs while designing across multiple components in a system based on the business requirements.
- Convert HLD to create detailed design for specific modules components of a product system.
- Understand nuances of designing for disaster recovery.
- Undertake infrastructure coding automation.
- Create and configure minimalistic Less Complex Highly Robust and high-quality code for a component module under guidance.
- Maintain records by documenting program development and revisions.
- Stay updated on the prevalent coding languages and frameworks in the industry outside the immediate scope of delivery.
- Identify repetitive and routine tasks in Continuous Integration Continuous Delivery CICD Testing or any other process that can be automated.
- Implement telemetry features as required under guidance.
- Apply security policy requirements to component module during code development configuration.
- Work with business partners to identify and document critical applications.
- Interprets and follows procedures in contingency plans.
- Explain the contingency and disaster recovery plans for assigned environment.
- Execute established procedures necessary to continue operations in an emergency.
- Participate in the design of a minimum operating environment for a computer based facility.
- Suggest metrics to monitor software or system performance.
- Monitor current performance data to ensure compliance with defined SLOs for multiple applications systems.
- Determine thresholds for monitoring metrics and triggers alerts based on thresholds.
- Supervise specific procedures to proactively check the health of applications and infrastructure including a variety of operating systems hardware and software.
Requirements
- Master’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 1 year of experience in site reliability engineering, site and system administration, infrastructure management, or related area; OR Bachelor’s degree or equivalent in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years of experience in site reliability engineering, site and system administration, infrastructure management, or related area.
- Experience designing and implementing performance test strategies for complex web, mobile, API, and backend systems for Jira and Confluence data center instances.
- Experience building and maintaining automated performance test scripts using tools including JMeter, Gatling, LoadRunner, and k6.
- Experience performing root cause analysis of performance issues in production and test environments for Jira and Confluence Data Center Instances, identifying CPU, memory, database, thread, and network bottlenecks.
- Experience monitoring system health, performance, and usage using tools including Grafana, Splunk, and Dynatrace, and ensuring compliance with internal SLAs.
- Experience designing and implementing observability (monitoring, logging, alerting) and ensuring SLAs and SLOs are met.
- Experience designing, implementing, and supporting large-scale Jira Software, Jira Service Management, and Confluence instances.
- Experience performing upgrades, patching, plugin management, and performance tuning for Atlassian platforms.
- Experience in integrating enterprise platforms with CI/CD pipelines, and observability tools to automate workflows, improve incident response, and enhance system reliability.
- Experience managing infrastructure components including Linux servers, databases, and storage supporting Atlassian tools in both on-prem and cloud environments.
- Experience working on scripting languages including Groovy, Bash and PowerShell to automate tasks on Linux and Windows.
- Experience implementing and maintaining backup, recovery, and disaster recovery plans for Atlassian tools.
Benefits
- Health benefits include medical, vision and dental coverage.
- Financial benefits include 401(k), stock purchase, and company-paid life insurance.
- Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
- Other benefits include short-term and long-term disability, education assistance with 100% company paid college degrees, company discounts, military service pay, adoption expense reimbursement, and more.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
site reliability engineeringperformance testingroot cause analysisinfrastructure managementdisaster recoveryscripting languagesautomationobservabilitymonitoringlogging
Soft Skills
troubleshootinganalytical skillscommunicationcollaborationproblem-solvingattention to detailorganizational skillsadaptabilitycritical thinkingtime management