Salary
💰 $148,000 - $287,500 per year
Tech Stack
GrafanaPrometheusPython
About the role
- Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from NVIDIA
- Work with customers, product management and other architects to narrow down on requirements for implementation
- Design architecture for fleet health monitoring and fault-remediation solution at scale
- Work with customers and other architects to understand health monitoring requirements and leverage in-band and out-of-band capabilities
- Create detailed architecture and perform POCs to validate architecture
- Educate customers about product architecture and incorporate feedback
- Write architecture specs and design documents; own end-to-end delivery across teams
- Perform code reviews for code produced from architecture specs
- Ensure product is properly tested; enhance unit testing and establish proper test plans
- Drive product life cycles with QA teams to productize code and act as product owner
- Articulate requirements in Jira and bug management tools and coordinate execution plans with managers
- Contribute to all phases of product development: definition, architecture, design, implementation, debugging, testing, and early customer support
Requirements
- BS, MS, or PhD in EE/CS or related field of education (or equivalent experience)
- 5+ years hands-on coding experience
- Strong knowledge of time series databases like Influxdb & Prometheus
- Strong knowledge of building and consuming REST APIs (Redfish is big plus)
- Strong knowledge of telemetry visualization solutions like Grafana & Influx
- Strong knowledge of firmware architecture, optimize firmware for low latency APIs
- Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
- Proven record of solutions for scalability
- Strong and demonstrable skill in C/C++ and Python
- Experience programming and debugging skills for server platforms
- Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
- Excellent written and oral communication skills
- Excellent work ethics, teamwork, and commitment to finishing tasks
- Self-starter with hands-on coding ability
- Ways to stand out: Experience building telemetry collection & analysis engines; Experience with Redfish; Experience with notification systems like PagerDuty; Active OCP and DMTF contribution; Hands on with x86 or ARM system architecture; Familiarity with Confidential Compute; Experience with ML and multi-variable optimization techniques