Salary
💰 $148,000 - $287,500 per year
Tech Stack
GrafanaPrometheusPython
About the role
- Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from Nvidia
- Work with customers, product management and architects to narrow requirements for implementation
- Design architecture for fleet health monitoring and fault-remediation at scale
- Understand customers' health monitoring requirements; use in-band and out-of-band capabilities
- Produce detailed architecture and run POCs to validate designs
- Educate customers about product architecture and incorporate feedback
- Write architecture specs and design documents; own end-to-end product delivery
- Review code produced from architecture specs and ensure code quality
- Ensure product is properly tested; enhance unit testing and test plans
- Drive product life cycles with QA teams and act as product owner
- Articulate requirements in Jira and manage end-to-end execution plans
- Contribute across all phases: product definition, architecture, design, implementation, debugging, testing, and early customer support
Requirements
- BS, MS, or PhD in EE/CS or related field (or equivalent experience)
- 5+ years hands-on coding experience
- Strong knowledge of time series databases like Influxdb & Prometheus
- Strong knowledge of building and consuming REST APIs (Redfish is big plus)
- Strong knowledge of telemetry visualization solutions like Grafana & Influx
- Strong knowledge of firmware architecture; optimize firmware for low latency APIs
- Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
- Proven record of solutions for scalability
- Strong and demonstrable skill in C/C++ and Python
- Experience programming and debugging skills for server platforms
- Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
- Excellent written and oral communication skills; strong teamwork and work ethic
- Self-starter, hands-on with coding and committed to delivering quality work
- (Nice-to-have) Experience building telemetry collection & analysis engines
- (Nice-to-have) Experience with Redfish
- (Nice-to-have) Experience with notification systems like PagerDuty
- (Nice-to-have) Active Open Compute (OCP) and DMTF contributor in relevant areas
- (Nice-to-have) Hands on with x86 or ARM system architecture
- (Nice-to-have) Familiarity with Confidential Compute
- (Nice-to-have) Experience with ML and multi-variable optimization techniques