Walmart

Software Engineer II – Site Reliability Operations Engineer

Walmart

full-time

Posted on:

Location Type: Hybrid

Location: Sunnyvale • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $104,000 - $202,000 per year

Job Level

JuniorMid-Level

Tech Stack

AzureCloudDistributed SystemsDNSFirewallsGoGoogle Cloud PlatformGrafanaGraphiteJavaJavaScriptLinuxNode.jsOpenStackPrometheusPythonReactServiceNowSplunkTCP/IPUnix

About the role

  • Acquire in-depth technical knowledge of omnichannel cloud platforms, web traffic flows, micro-services, and service dependencies for major incident resolution.
  • Provide support for Unix and Linux systems from Kernel to Shell and beyond, taking into consideration system libraries, file systems, and client-server protocols.
  • Leverage knowledge of network technologies such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, CDN, OSI layers, Firewalls, Gateway, Proxy, and Load balancers.
  • Provide L1 and L2 production support for multiple cloud technologies such as Open stack, Cloud Native platform, Microsoft Azure, and Google Cloud Platform for triaging critical issues using various internal and vendor-related tools.
  • Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Kibana, Graphite, Service Now, JIRA, Dynatrace, New Relic, Omniture, Splunk, and CDN logs.
  • Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery, and focusing on immediate restoration of large-scale enterprise systems.
  • Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Kibana, Splunk, Graphite, New Relic to improve visibility, pro-actively detect issues and restore system availability.
  • Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight and xMatters.
  • Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages.
  • Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments.

Requirements

  • 2+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems.
  • Bachelor's Degree in Computer Science or a related field, or relevant work experience.
  • Strong and demonstrable incident management skills with relevant experience in an enterprise organization.
  • Experience and exposure working in a 24/7 operations support environment.
  • Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive.
  • Experience investigating, analyzing and troubleshooting large scale enterprise systems.
  • Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
  • Experience administering Unix/Linux in a production environment.
  • Experience working with and developing enterprise monitoring/tooling/logging solutions like Grafana, Kibana, Splunk, Openobserve, Graphite, Nagios, New Relic, DynaTrace and Prometheus.
  • Working knowledge of one or more cloud technologies such as AZURE, GCP, OpenStack.
  • Experience with distributed version control like Git or similar
  • Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters
  • Programming experience in one or more of the following languages: Go, Java, Python, Shell, etc.
  • Experience in data science/machine learning would be advantageous.
Benefits
  • Health benefits including medical, vision and dental coverage
  • 401(k)
  • Stock purchase and company-paid life insurance
  • PTO (including sick leave, parental leave, family care leave, bereavement, jury duty, and voting)
  • Short-term and long-term disability
  • Company discounts
  • Military Leave Pay
  • Adoption and surrogacy expense reimbursement
  • Live Better U education benefit program for associates

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
UnixLinuxJavaJavaScriptPythonShellTCP/IPUDPICMPCloud technologies
Soft skills
incident managementproblem-solvingownershipinitiativedrive
Certifications
Bachelor's Degree in Computer Science
Anomali

DevSecOps Engineer

Anomali
Mid · Seniorfull-time$145k–$170k / yearCalifornia · 🇺🇸 United States
Posted: 3 days agoSource: jobs.lever.co
AWSCloudGoKubernetesPythonTerraformVault
The Walt Disney Company

Manager, Database Reliability Engineering

The Walt Disney Company
Senior · Leadfull-time$145k–$195k / yearCalifornia, New York, Washington · 🇺🇸 United States
Posted: 11 days agoSource: disney.wd5.myworkdayjobs.com
AirflowAnsibleAWSAzureCassandraCloudETLJavaKafkaMongoDBMySQLNoSQL+7 more
Gusto

Staff Network Reliability Engineer

Gusto
Leadfull-time$150k–$195k / yearCalifornia, Colorado, New York, Washington · 🇺🇸 United States
Posted: 20 days agoSource: boards.greenhouse.io
AWSCloudFirewallsGoLinuxPython