Red Hat

Senior Performance and Resilience Engineer, LLM Inference

Red Hat

full-time

Posted on:

Origin:  • 🇺🇸 United States • Massachusetts, North Carolina, Virginia, Washington

Visit company website
AI Apply
Manual Apply

Salary

💰 $127,890 - $211,180 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsGrafanaKubernetesLinuxOpenShiftOpen SourcePrometheusPythonPyTorch

About the role

  • Lead AI workloads fault injection and resilience-at-scale efforts for vLLM and llm-d on Kubernetes/OpenShift
  • Design and automate failure and resiliency experiments across vLLM, llm-d, and heterogeneous AI accelerators
  • Define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
  • Build an automated harness (preferably extending krkn-chaos) to run controlled experiments with scoped blast radius and evidence capture
  • Integrate fault signals into pipelines as resilience gates alongside performance gates
  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
  • Triage and root-cause resilience regressions; upstream bugs and fixes to vLLM and llm-d
  • Publish failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present findings internally/externally

Requirements

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems
  • Expertise in systems‑level software design
  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
  • Observability & forensics skills with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler
  • Fluency in Python (data & ML) and strong Bash/Linux skills
  • Exceptional communication skills to translate raw data into customer and executive narratives
  • Commitment to open‑source values and upstream collaboration
  • (Plus) Master’s or PhD in Computer Science, AI, or related field
  • (Plus) History of upstream contributions, public talks or blogs on resilience, or chaos engineering
  • (Plus) Competitive benchmarking and failure characterization at scale