Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
FluidStack

Production Engineer, Compute

FluidStack

Production Engineer managing compute fleet health and automation for AI infrastructure at Fluidstack. Designing efficient metrics pipelines and handling recovery automation for production failures.

Posted 6/9/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $175,000 - $300,000 per yearWebsite

Tech Stack

Tools & technologies
GoGrafanaKubernetesPrometheusPython

About the role

Key responsibilities & impact
  • Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU and TPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.
  • Turn repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service. No one-off scripts, no heroics.
  • Design and expand the XPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU and TPU generation. You define what "good" looks like before hardware goes into production.
  • Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.
  • Own end-to-end reliability, scalability, and operation of the compute fleet at scale. Fluidstack is building one of the largest XPU fleets in the world and that can only be accomplished with aggressive automation, tooling, and incident discipline.

Requirements

What you’ll need
  • You treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.
  • You have an instinct for hardware. You're comfortable reasoning about failure modes at the firmware and silicon level, not just the software stack above it.
  • You move toward ambiguity, not away from it. You walk into the fog, build the map, and explain it to everyone else.
  • You learn at a steep slope. You reach real competence in an unfamiliar domain fast. We value this over existing expertise.
  • You carry a pager without flinching. You run the incident, write the postmortem, fix the systemic cause, and move on.
  • You're fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks, and you drive Claude Code, Cursor, or similar every day.
  • You've shipped production automation that other teams depend on, and you're comfortable in any language using AI coding tools.
  • Bonus: Hardware lifecycle management and RMA automation. BMC/Redfish or IPMI tooling. GPU/TPU qualification or burn-in frameworks. Workflow and orchestration engines (Temporal, Cadence). Metrics and alerting pipelines (Prometheus, Grafana). Go or Python.

Benefits

Comp & perks
  • Competitive total compensation package (salary + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
automationmetrics pipelinesrepair automationGPU qualificationTPU qualificationfirmware-level telemetrylog collectionproduction automationGoPython
Soft Skills
problem-solvingadaptabilitylearning agilityincident managementcommunicationownershipattention to detailcollaborationcritical thinkingresilience