Salary
💰 $150,000 - $180,000 per year
Tech Stack
AzureSwitchingTypeScript
About the role
- Maintain 24×7 situational awareness of network health (availability, latency, loss, jitter, capacity, link/path status)
- Triage, contain, and restore service during incidents using documented runbooks/playbooks; coordinate swarming with Routing/Switching, Boundary Security, Platform, and Cyber teams
- Execute initial impact assessment, user/stakeholder communications, workarounds, and continuity steps, capture timelines and decisions for post incident review
- Operate dashboards, alerts, and synthetic/active tests; tune thresholds to reduce noise while protecting SLOs
- Correlate telemetry (syslog, SNMP/streaming telemetry, NetFlow/IPFIX, route/adjacency state) with actionable escalation paths
- Validate monitoring for new or changed network services (health checks and alerts in place before go live)
- Enforce CAB/CCB decisions, maintenance windows, freeze periods, and back out plans for network changes
- Verify pre change readiness (peer review, approvals, rollback tested, comms prepared) and confirm post change health
- Keep CMDB/CIs and topology/dependency maps current, record changes and relationships for traceability
- Lead problem investigations (recurring incidents, trend spikes); run root cause analysis (RCA) and durable corrective actions
- Track availability, MTTR/MTBF, error budgets, and capacity trends; recommend scaling, policy tuning, or resiliency patterns (path diversity, ECMP, QoS adjustments)
- Drive backlog items with owners and verify closure via effectiveness checks
- Execute patch/vulnerability windows for boundary devices per plan; validate policy deployments and change results
- Operate within RMF/DISA STIG constraints; preserve audit trails and evidence for assessments and ATO/cATO sustainment
- Coordinate with Cyber/Blue Team on detections, containment steps, and after-action improvements
- Measure and report SLAs/SLOs (site availability, ticket KPIs, change success, incident induced change rate) with daily/weekly/monthly executive rollups
- Maintain stakeholder communications for incidents, maintenance, and changes throughout the event lifecycle
- Author/maintain runbooks, troubleshooting guides, operational standards, KEDB (Known Error DB), and service catalogs; keep knowledge articles current
- Contribute to readiness reviews, go live checklists, and lessons learned, coach junior controllers and cross train peers
- Other duties as assigned
Requirements
- Bachelor’s Degree in Computer Science or Information Technology preferred
- DoD 8570 Level I certification required (i.e. Security+) required or must be obtained within 90 days of hire
- ITIL® 4 Foundation certification preferred
- CCNA (or JNCIA), with progress toward CCNP preferred
- Fortinet NSE 4+ and/or firewall gateway associate level credential preferred
- 14+ years of professional experience in the required task area
- Higher educational degree may be partially substituted for experience
- 7+ years of Network specific experience
- 3+ years of experience in network operations/NOC for large enterprise or DoD Environments
- 1+ years of professional experience working in a management or leadership role
- Must have a solid grasp of L2/L3 fundamentals (VLANs, HSRP/VRRP, routing protocols such as OSPF/BGP, ACLs/NAT), SD WAN concepts, QoS basics, and boundary security operations
- Familiarity with telemetry sources (syslog, SNMP, NetFlow/IPFIX, streaming telemetry) and how to interpret them during incidents
- Familiarity with Microsoft 365, Azure, Active Directory, or similar enterprise platforms
- Strong experience with incident/problem/change management discipline (ITIL 4), clear communications, and calm, structured decision making under pressure
- Familiarity with ITIL-based service operations and ticketing systems preferred
- Must be a US citizen
- Candidate must be in possession of a minimum DoD issued Secret Clearance and eligibility for TS/SCI
- Candidates with active current TS/SCI preferred
- Able to occasionally reach with hands and arms
- Prolonged periods of computer screen use, while sitting or standing at a desk
- Adhere to safety protocols when in work areas requiring use of PPE (e.g. eyewear, gloves, masks, hearing protection, steel toed shoes, etc.)
- Able to safely lift and carry up to 20 pounds at a time