Walmart

Systems and Infrastructure Engineer II

Walmart

full-time

Posted on:

Location Type: Office

Location: Bangalore • 🇮🇳 India

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AzureCloudDistributed SystemsDNSGoGoogle Cloud PlatformGrafanaGraphiteJavaLinuxOpenStackPythonRubySplunkTCP/IPUnix

About the role

  • As a Store Reliability Operations Engineer within the Global Technology Platforms (GTP) CCC team you will work with other CCC, TDO, SRE, DevOps and Engineering practitioners to pro-actively maintain mission-critical infrastructure, cloud platforms, micro-services, tools, and processes that will ensure highest levels of availability and reliability across our Global Technology platforms.
  • You're right for the job if you are comfortable leading our major incident response team as part of a technical team of engineer’s laser focused on restoring service across complex distributed systems.
  • You'll excel if you have enthusiasm for digging deep, and a flare for sharp technical communication, prioritization and organization.
  • You will work directly with our SRE, Engineering and DevOps teams to support our next generation “always up” cloud-based e-commerce platforms.
  • The CCC Site Reliability Operations Engineer is responsible for pro-actively monitoring, detecting and resolving site issues before they become customer and availability impacting.
  • Technically you will understand the full end to end stack and use this knowledge to detect errors/failures and take corrective action to mitigate.
  • During a major incident, you will draw on your technical skills and knowledge to triage and troubleshoot, differentiating between symptom and cause, to help restore impacting issues.
  • Your ability to continuously challenge yourself and develop a strong network within your peer group will see you exceed in this role.
  • Our goal is to protect the customer experience and deliver outstanding levels of availability.
  • Xmatters workflow integration with scalability, resiliency and performance.
  • Assist Walmart Store/Distribution Center associate’s in their day-to-day issues related to store functions.
  • Support through Call functioning for internal escalations.
  • Follow the SOP for troubleshooting and get resolutions for issues reported by Store operations team.
  • Must be able to do multitasking whenever needed.
  • Flexible in Shift & Support hours.
  • Expert level understanding of incident management processes and procedures.
  • Calm under pressure when participating in major incident response.
  • Deep technical understanding of core infrastructure, cloud services, platforms and micro-services.
  • Ability to understand and capture key data from logs at an expert level.
  • Ability to understand traffics flows and key dependencies between services.
  • Ability to effectively triage – be able to detect and determine symptom vs cause.
  • Detect and quantify impact.
  • Expert level troubleshooting skills using a diverse set of tools and methods.
  • Analyze trends to pro-actively prevent incidents.
  • Focus on immediate restoration vs root cause.
  • Research and recommend alternative actions for incident resolution – Develop procedures and documentation to support this.
  • Create and maintain procedural documentation.
  • Identify and drive continuous improvement efforts to reduce waste (eliminate, automate or streamline).
  • Absorb knowledge and understand complex distributed systems - ability to share and impart this knowledge into your peer group and beyond.
  • Build tools to improve visibility, pro-actively detect issues and restore system availability.
  • Develop automation and self-healing with DevOps, Engineering and SRE partners.
  • Strong focus on collecting and inferring metrics.
  • Clear communication skills.
  • Ability to contribute to multiple incidents at any given time.
  • Analyze systems and make recommendations to prevent possible problems.
  • Takes lead on issue resolution activities using knowledge of complex and company-wide systems.
  • Scripting and software development to automate and help enhance existing solutions.
  • Experience owning, developing and evangelizing a product.
  • Ability to gather requirements and build solutions into a product.
  • Evangelize operational excellence.
  • Actively provide data for and participate in root cause analysis.
  • Define CCC onboarding process and ensure they are adhered to when accepting new systems into service.
  • Share knowledge globally between CCC teams.
  • Analyze systems and make recommendations to prevent possible incidents.
  • Strive for continuous improvement and make recommendations based on CCC process.
  • Act as a technical focal point for the CCC team.
  • Transition observability projects in the command centre for better visibility.
  • Other duties and responsibilities as assigned.

Requirements

  • Experience building and scaling distributed, highly available systems
  • Experience developing applications for a cloud environment such as Google Cloud Platform or Microsoft Azure
  • Experience with frameworks/tools such as GIT, xMatters workflow integration, Service Now Integration etc
  • Comfortable building metrics, monitoring, and alerting for micro-services
  • 3+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
  • Bachelor's Degree in Computer Science or a related field, or relevant work experience.
  • Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
  • 3+ Years of relevant experience on Major Incident Management with ITIL4 Certification
  • Experience and exposure working is a 24/7 operations support environment.
  • Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive.
  • Experience investigating, analyzing and troubleshooting large scale enterprise systems.
  • Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
  • Programming experience in one or more of the following languages: Go, Java, Python, Ruby, Shell.
  • Experience administering Unix/Linux in a production environment.
  • Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
  • Experience working with and developing enterprise monitoring/tooling solutions like Grafana, Kibana, Splunk, Graphite, Nagios, New Relic and DynaTrace.
  • Working knowledge of one or more cloud technologies such as AZURE, GCP and OpenStack.
  • Working knowledge of CI/CD pipelines.
Benefits
  • Beyond our great compensation package, you can receive incentive awards for your performance.
  • Other great perks include a host of best-in-class benefits maternity and parental leave, PTO, health benefits, and much more.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
incident managementcloud environmentprogrammingmonitoringtroubleshootingautomationdistributed systemsnetworkingscriptingDevOps
Soft skills
technical communicationprioritizationorganizationcalm under pressureproblem-solvinginitiativedrivemultitaskingcontinuous improvementcollaboration
Certifications
ITIL4 CertificationBachelor's Degree in Computer Science