Work on a dynamic customer-focused team interacting with customers, partners, and internal teams to analyze, define, and implement large-scale networking projects
Build AI/HPC infrastructure for large CSP customers and their end users
Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting
Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement
Maintain services once live by measuring and monitoring availability, latency, and overall system health
Provide feedback to internal teams: open bugs, document workarounds, drive customer feature requirements, and suggest improvements
Serve as the face to the customer for networking solutions
Requirements
BS/MS/PhD or equivalent experience in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, or related fields
8+ years of professional experience in networking fundamentals, TCP/IP stack, InfiniBand fundamentals and data center architecture
Proficiency in configuring, testing, validating, and resolving issues in Ethernet and InfiniBand networks, especially in medium to large-scale HPC/AI environments
Advanced knowledge of HPC/AI networking protocols
Hands-on experience with network switch/router platforms like Cumulus Linux, SONiC, IOS, JunosOS, and EOS
Strong focus on customer needs and satisfaction
Self-motivated with leadership skills to work collaboratively with customers and internal teams
Strong written, verbal, and listening skills
Familiarity with cloud networks (AWS, GCP, Azure) is a plus
Linux or Networking Certifications (preferred)
Knowledge in link level performance and diagnostics (preferred)
Experience with High-performance computing architectures (preferred)
Experience with GPU hardware/software (preferred)
Benefits
Eligible for equity
Benefits
Remote work option
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.