
Systems and Infrastructure Engineer II
Walmart
full-time
Posted on:
Location Type: Office
Location: Bangalore • 🇮🇳 India
Visit company websiteJob Level
Mid-LevelSenior
Tech Stack
AzureCloudDistributed SystemsDNSGoGoogle Cloud PlatformGrafanaGraphiteJavaLinuxOpenStackPythonRubySplunkTCP/IPUnix
About the role
- As a Store Reliability Operations Engineer within the Global Technology Platforms (GTP) CCC team you will work with other CCC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability across our Global Technology platforms.
- You're right for the job if you are comfortable leading our major incident response team as part of a technical team of engineer’s laser focused on restoring service across complex distributed systems.
- You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization.
- You will work directly with our SRE, Engineering and DevOps teams to support our next generation “always up” cloud-based e-commerce platforms.
- The CCC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting.
- Technically you will understand the full end to end stack and use this knowledge to detect errors/failures and take corrective action to mitigate.
- During a major incident, you will draw on your technical skills and knowledge to triage and troubleshoot, differentiating between symptom and cause, to help restore impacting issues.
- Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role.
- Our goal is to protect the customer experience and deliver outstanding levels of availability.
- Xmatters workflow integration with scalability, resiliency and performance.
- Assist Walmart Store/Distribution Center associate’s in their day-to-day issues related to store functions.
- Support through Call functioning for internal escalations.
- Follow the SOP for troubleshooting and get resolutions for issues reported by Store operations team.
- Must be able to do multitasking whenever needed.
- Flexible in Shift & Support hours.
- Expert level understanding of incident management processes and procedures.
- Calm under pressure when participating in major incident response.
- Deep technical understanding of core infrastructure, cloud services, platforms and micro-services.
- Ability to understand and capture key data from logs at an expert level.
- Ability to understand traffics flows and key dependencies between services.
- Ability to effectively triage – be able to detect and determine symptom vs cause.
- Detect and quantify impact.
- Expert level troubleshooting skills using a diverse set of tools and methods.
- Analyze trends to pro-actively prevent incidents.
- Focus on immediate restoration vs root cause.
- Research and recommend alternative actions for incident resolution – Develop procedures and documentation to support this.
- Create and maintain procedural documentation.
- Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
- Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
- Build tools to improve visibility, pro-actively detect issues and restore system availability.
- Develop automation and self-healing with DevOps, Engineering and SRE partners.
- Strong focus on collecting and inferring metrics.
- Clear communication skills.
- Ability to contribute to multiple incidents at any given time.
- Analyze systems and make recommendations to prevent possible problems.
- Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
- Scripting and software development to automate and help enhance existing solutions.
- Experience owning, developing and evangelizing a product.
- Ability to gather requirements and build solutions into a product.
- Evangelize operational excellence.
- Actively provide data for and participate in root cause analysis.
- Define CCC onboarding process and ensure they are adhered to when accepting new systems into service.
- Share knowledge globally between CCC teams.
- Analyze systems and make recommendations to prevent possible incidents.
- Strive for continuous improvement and make recommendations based on CCC process.
- Act as a technical focal point for the CCC team.
- Transition observability projects in the command centre for better visibility.
- Other duties and responsibilities as assigned.
Requirements
- Experience building and scaling distributed, highly available systems
- Experience developing applications for a cloud environment such as Google Cloud Platform or Microsoft Azure
- Experience with frameworks/tools such as GIT, xMatters workflow integration, Service Now Integration etc
- Comfortable building metrics, monitoring, and alerting for micro-services
- 3+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
- Bachelor's Degree in Computer Science or a related field, or relevant work experience.
- Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
- 3+ Years of relevant experience on Major Incident Management with ITIL4 Certification
- Experience and exposure working is a 24/7 operations support environment.
- Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive.
- Experience investigating, analyzing and troubleshooting large scale enterprise systems.
- Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.
- Experience administering Unix/Linux in a production environment.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
- Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic and DynaTrace.
- Working knowledge of one or more cloud technologies such as AZURE, GCP and OpenStack.
- Working knowledge of CI/CD pipelines.
Benefits
- Beyond our great compensation package, you can receive incentive awards for your performance.
- Other great perks include a host of best-in-class benefits maternity and parental leave, PTO, health benefits, and much more.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
incident managementcloud environmentprogrammingmonitoringtroubleshootingautomationdistributed systemsnetworkingscriptingDevOps
Soft skills
technical communicationprioritizationorganizationcalm under pressureproblem-solvinginitiativedrivemultitaskingcontinuous improvementcollaboration
Certifications
ITIL4 CertificationBachelor's Degree in Computer Science