Salary
💰 $127,890 - $211,180 per year
Tech Stack
CloudDistributed SystemsGrafanaKubernetesLinuxOpenShiftOpen SourcePrometheusPythonPyTorch
About the role
- Lead AI workloads fault injection and resilience-at-scale efforts for vLLM and llm-d on Kubernetes/OpenShift
- Design and automate failure and resiliency experiments across vLLM, llm-d, and heterogeneous AI accelerators
- Define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
- Build an automated harness (preferably extending krkn-chaos) to run controlled experiments with scoped blast radius and evidence capture
- Integrate fault signals into pipelines as resilience gates alongside performance gates
- Develop detection and diagnostics: dashboards and alerts for pre-fault signals (queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
- Triage and root-cause resilience regressions; upstream bugs and fixes to vLLM and llm-d
- Publish failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present findings internally/externally
Requirements
- 3+ years in reliability, and/or performance engineering on large-scale distributed systems
- Expertise in systems‑level software design
- Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
- Observability & forensics skills with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler
- Fluency in Python (data & ML) and strong Bash/Linux skills
- Exceptional communication skills to translate raw data into customer and executive narratives
- Commitment to open‑source values and upstream collaboration
- (Plus) Master’s or PhD in Computer Science, AI, or related field
- (Plus) History of upstream contributions, public talks or blogs on resilience, or chaos engineering
- (Plus) Competitive benchmarking and failure characterization at scale