Senior Performance and Resilience Engineer, LLM Inference

Red Hat

full-time

Posted on: 8/28/2025

Location: Massachusetts, North Carolina, Virginia, Washington • 🇺🇸 United States

✨ AI Apply

💰 $127,890 - $211,180 per year

Senior

CloudDistributed SystemsGrafanaKubernetesLinuxOpenShiftOpen SourcePrometheusPythonPyTorch

About the role

Lead AI workloads fault injection and resilience-at-scale efforts for vLLM and llm-d on Kubernetes/OpenShift
Design and automate failure and resiliency experiments across vLLM, llm-d, and heterogeneous AI accelerators
Define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
Build an automated harness (preferably extending krkn-chaos) to run controlled experiments with scoped blast radius and evidence capture
Integrate fault signals into pipelines as resilience gates alongside performance gates
Develop detection and diagnostics: dashboards and alerts for pre-fault signals (queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
Triage and root-cause resilience regressions; upstream bugs and fixes to vLLM and llm-d
Publish failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present findings internally/externally

3+ years in reliability, and/or performance engineering on large-scale distributed systems
Expertise in systems‑level software design
Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
Observability & forensics skills with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler
Fluency in Python (data & ML) and strong Bash/Linux skills
Exceptional communication skills to translate raw data into customer and executive narratives
Commitment to open‑source values and upstream collaboration
(Plus) Master’s or PhD in Computer Science, AI, or related field
(Plus) History of upstream contributions, public talks or blogs on resilience, or chaos engineering
(Plus) Competitive benchmarking and failure characterization at scale