Red Hat

Senior Performance and Resilience Engineer, LLM Inference

Red Hat

full-time

Posted on:

Location: Massachusetts, North Carolina, Virginia, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $127,890 - $211,180 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsGrafanaKubernetesLinuxOpenShiftOpen SourcePrometheusPythonPyTorch

About the role

  • Lead AI workloads fault injection and resilience-at-scale efforts for vLLM and llm-d on Kubernetes/OpenShift
  • Design and automate failure and resiliency experiments across vLLM, llm-d, and heterogeneous AI accelerators
  • Define resilience indicators, prioritize fault scenarios, and establish go/no-go gates for releases and CI/CD
  • Build an automated harness (preferably extending krkn-chaos) to run controlled experiments with scoped blast radius and evidence capture
  • Integrate fault signals into pipelines as resilience gates alongside performance gates
  • Develop detection and diagnostics: dashboards and alerts for pre-fault signals (queue depth, GPU throttling, P2P downgrades, KV-cache pressure, allocator fragmentation)
  • Triage and root-cause resilience regressions; upstream bugs and fixes to vLLM and llm-d
  • Publish failure patterns, playbooks, SLO templates, experiment libraries, and reference architectures; present findings internally/externally

Requirements

  • 3+ years in reliability, and/or performance engineering on large-scale distributed systems
  • Expertise in systems‑level software design
  • Expertise with Kubernetes and modern LLM inference server stack (e.g., vLLM, TensorRT-LLM, TGI)
  • Observability & forensics skills with Prometheus/Grafana, OpenTelemetry tracing, eBPF/BPFTrace/perf, Nsight Systems, PyTorch Profiler
  • Fluency in Python (data & ML) and strong Bash/Linux skills
  • Exceptional communication skills to translate raw data into customer and executive narratives
  • Commitment to open‑source values and upstream collaboration
  • (Plus) Master’s or PhD in Computer Science, AI, or related field
  • (Plus) History of upstream contributions, public talks or blogs on resilience, or chaos engineering
  • (Plus) Competitive benchmarking and failure characterization at scale
The Vertex Companies LLC

Senior Forensic Engineer

The Vertex Companies LLC
Seniorfull-time$130k–$180k / yearMassachusetts · 🇺🇸 United States
Posted: 9 hours agoSource: jobs.smartrecruiters.com
TransUnion

Business Intelligence Engineer

TransUnion
Mid · Seniorfull-time🇺🇸 United States
Posted: 9 hours agoSource: transunion.wd5.myworkdayjobs.com
CloudMongoDBNoSQLPythonSQLSSIS
Fiserv

Senior Splunk Engineer

Fiserv
Seniorfull-time$82k–$142k / yearNew Jersey, New York, Texas · 🇺🇸 United States
Posted: 9 hours agoSource: fiserv.wd5.myworkdayjobs.com
CloudJMeterPythonSeleniumSplunk
Eli Lilly and Company

Associate Process Engineer – Dry External Manufacturing

Eli Lilly and Company
Junior · Midfull-time$53k–$154k / year🇺🇸 United States
Posted: 11 hours agoSource: lilly.wd5.myworkdayjobs.com
PerkinElmer

Senior Project Engineer

PerkinElmer
Seniorfull-time$70k–$85k / yearNorth Carolina · 🇺🇸 United States
Posted: 11 hours agoSource: newperkinelmer.wd1.myworkdayjobs.com