Salary
💰 $97,900 - $150,600 per year
Tech Stack
AnsibleAWSAzureCloudFirewallsGoogle Cloud PlatformGrafanaPrometheusPythonServiceNowTCP/IP
About the role
- Lead and Mentor: Build, lead, and inspire a high-performing team of network observability engineers and specialists, fostering a culture of continuous learning, ownership, and technical excellence.
- Strategy & Roadmap: Define and execute the strategic roadmap for network observability, aligning with overall network engineering and business objectives. This includes evolving our monitoring, alerting, logging, and tracing capabilities.
- Platform Ownership: Serve as the primary owner network observability, optimizing its configuration, performance, and integrations to ensure it provides a unified and real-time view of network health. Drive feature adoption and define future enhancements.
- Metrics & Alerts: Establish, refine, and enforce comprehensive network health metrics (SLIs, SLOs, KPIs) and develop intelligent, actionable alerting strategies to minimize noise and improve incident response.
- Incident Management & Post-Mortems: Collaborate closely with NOC and SRE teams to improve incident detection, triage, and resolution processes. Drive blameless post-mortems and ensure lessons learned are translated into system and process improvements.
- Automation: Champion and implement automation initiatives within observability, leveraging tools and scripting (e.g., Python, Ansible) to automate new service, data collection, analysis, reporting, and remediation workflows.
- Cross-Functional Collaboration: Partner effectively with Network Engineering, Software Engineering, NOC, ISP/OSP, Security, and Product teams to understand their observability needs, provide necessary insights, and ensure seamless integration of monitoring solutions.
- Tooling & Ecosystem: Evaluate, select, and integrate supplementary observability tools and technologies as needed to complement Assure1 and enhance our overall network visibility.
- Reporting & Insights: Develop and deliver insightful reports and dashboards that provide clear visibility into network performance, reliability, and trends for various stakeholders, from operations to executive leadership.
- Vendor Management: Manage relationships with key observability vendors, including Assure1, to ensure optimal licensing, support, and feature development.
Requirements
- Bachelor's degree in Computer Science, Electrical Engineering, or a related technical field; or equivalent practical experience.
- Minimum of five (5) years of experience in network operations or site reliability engineering
- Minimum of two (2) years in a leadership, management, or product owner role.
- Experience with network monitoring and event management platforms, including hands on experience
- Strong understanding of networking protocols (TCP/IP, BGP, OSPF), network services, and common network devices (routers, switches, firewalls, load balancers).
- Proven ability to define, implement, and optimize network health metrics (SLIs, SLOs) and alerting strategies.
- Experience with scripting and automation (e.g., Python or Ansible)
- Excellent analytical and problem-solving skills, with a track record of driving root cause analysis on complex issues.
- Strong communication and interpersonal skills, with the ability to articulate technical concepts clearly to both technical and non-technical audiences.
- Experience with the overall incident management processes and tools, including ServiceNow and troubleshooting tools