Salary
💰 $140,000 - $215,000 per year
Tech Stack
AWSCloudGoGoogle Cloud PlatformPerlPython
About the role
- Set the direction for and improve the reliability and efficiency of the network
- Contribute to maintaining a high-performance, fault-tolerant, and scalable network
- Define metrics, develop tools, and create approaches that improve how we monitor and operate the network
- Develop, track, and report on KPIs and metrics that measure network capacity, performance, and availability
- Build tools and monitoring systems that provide granular, real-time observability
- Develop automation to continuously assess and detect suboptimal network state and identify potential points of failure
- Review designs and traffic patterns to continually assess network capacity and availability
- Work with other engineering groups to close the feedback loop on areas for improvement
- Lead resolution of network incidents, conduct internal post-mortems, perform root cause analysis, and ensure corrective actions are taken in a timely manner
- Diagnose and solve complex network and application problems, and recommend improvements
- Participate in a 24X7 on-call rotation
Requirements
- United States Citizenship OR Permanent Residency is necessary to retain access to resources for this role (NO Clearance necessary)
- 7+ years deploying and managing network infrastructure
- Experience leading a sustaining engineering or SRE team
- 7+ years experience working with network protocols such as BGP, MPLS (TE, Auto-BW), VxLAN, eVPN, and CLOS Architectures
- Experience with building and maintaining network monitoring and graphing tools, as well as streaming telemetry
- Programming experience in Python, Perl, Go or other scripting language
- Experience with Cloud Providers such as AWS and GCP
- Ability to participate in a 24X7 on-call rotation
- Willingness to periodically undergo and pass additional background and fingerprint check(s) consistent with government customer requirements
- (Bonus) Strong track record of developing and improving tools, platforms, and infrastructure
- (Bonus) Experience with network simulation and testing tools (NS-3, NetSim, Batfish, Ixia)
- (Bonus) Production level experience supporting large scale network infrastructure
- (Bonus) Experience in the automation of systems to reduce operational toil