Tech Stack
AnsibleFirewallsGoGrafanaLinuxPrometheusPythonSaltStackServiceNowSwitching
About the role
- Owning the operational aspect of the network infrastructure, ensuring its high availability and reliability.
- Partnering with architecture and deployment teams to guarantee that new implementations are supportable and align with production standards.
- Advocating for and implementing automation to reduce toil and enhance operational efficiency.
- Monitoring network performance, identifying areas for improvement, and coordinating with relevant teams to execute enhancements.
- Collaborating with SMEs to resolve production issues swiftly and effectively, maintaining customer satisfaction.
- Identifying opportunities for operational improvements and partnering with teams to develop solutions that drive excellence and sustainability in network operations.
- Minimizing manual labor, achieving Service Level Objectives (SLOs), documenting KB articles for bots, following through on RCAs, and conducting blameless postmortems.
- Hands-on troubleshooting, network automation, observability, documentation, and excellence in operations.
- Mentoring and fostering professional development and growth within the team.
Requirements
- BS degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent experience.
- Minimum of 8 years of industry experience in network site reliability engineering, network automation, network operations, or related areas.
- Experience on both campus and data center networks.
- Familiarity with network management tools such as Prometheus, Grafana, Alert Manager, Nautobot/Netbox, BigPanda.
- Expertise in automating networks using frameworks such as Salt, Ansible, or similar.
- In depth experience in one or more of the following: Python, Go.
- Knowledge in network technologies such as TCP/UDP, IPv4/IPv6, Wireless, BGP, VPN, L2 switching, Firewalls, Load Balancers, EVPN, VxLAN, Segment Routing.
- Proven track record in network operations.
- Skills with ServiceNow and Jira.
- Knowledge of Linux system fundamentals is a plus.
- Systematic problem-solving approach, coupled with excellent communication skills and a sense of ownership and drive.
- Ways to stand out: experience taking operational signals through SNMP, Syslog, Streaming Telemetry; debugging and optimizing code; automating routine tasks; experience with Mellanox/Cumulus Linux, Palo Alto firewalls, Netscalers and F5 load balancers; previous SRE experience.