Salary
💰 $184,000 - $356,500 per year
Tech Stack
CloudKubernetesNode.js
About the role
- Platform fundamentals: design, build, and operate core services and node/cluster foundations for Lepton platform; automate deployments, upgrades, and day-2 operations.
- Vulnerability & patch management: own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components for Lepton product.
- Security as a product quality: define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
- Identity & access stewardship: standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene.
- Trusted releases: drive change control and release practices that ensure traceability and integrity of what runs in production.
- Monitoring & incident practice: establish health signals and SLOs; lead investigations, root causes, and follow-through actions that improve both reliability and security.
- Risk & readiness: partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls.
- Documentation & mentorship: publish runbooks and standards; review designs and coach engineers on secure operational practices.
Requirements
- 7+ years in systems/platform engineering operating large-scale, production environments.
- Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
- Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or the ability to ramp quickly.
- Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
- Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).