Production Engineer, Compute

FluidStack

Production Engineer managing compute fleet health and automation for AI infrastructure at Fluidstack. Designing efficient metrics pipelines and handling recovery automation for production failures.

Posted 6/9/2026full-timeSan Francisco • California • 🇺🇸 United StatesMid-LevelSenior💰 $175,000 - $300,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

automationmetrics pipelinesrepair automationGPU qualificationTPU qualificationfirmware-level telemetrylog collectionproduction automationGoPython

Soft Skills

problem-solvingadaptabilitylearning agilityincident managementcommunicationownershipattention to detailcollaborationcritical thinkingresilience

Tools & Technologies

KubernetesRedfishBMC toolingAI toolingLLM APIsMCP serversClaude CodeCursorPrometheusGrafana

Industry Keywords

compute fleetXPUburn-inperformance baseliningNPI executionhardware lifecycle managementRMA automationIPMI toolingworkflow enginesorchestration engines

Tech Stack

Tools & technologies

GoGrafanaKubernetesPrometheusPython

About the role

Key responsibilities & impact

Own compute fleet health end to end. Build the metrics pipelines, alerting, and unified health view that tell you the true state of every GPU and TPU in production — across Kubernetes-orchestrated workloads and bare metal, at scale.
Turn repair into a pipeline, not a procedure. Build and own the automation that takes a compute failure from detection through triage, parts management, and return to service. No one-off scripts, no heroics.
Design and expand the XPU qualification platform. Burn-in, performance baselining, and NPI execution for every new GPU and TPU generation. You define what "good" looks like before hardware goes into production.
Own Redfish and BMC tooling. Firmware-level telemetry, log collection at fleet scale, and the low-level access layer that repair automation and health tooling depend on.
Own end-to-end reliability, scalability, and operation of the compute fleet at scale. Fluidstack is building one of the largest XPU fleets in the world and that can only be accomplished with aggressive automation, tooling, and incident discipline.

Requirements

What you’ll need

You treat toil as a bug. Manual steps in a repair workflow are a backlog item, not a job description.
You have an instinct for hardware. You're comfortable reasoning about failure modes at the firmware and silicon level, not just the software stack above it.
You move toward ambiguity, not away from it. You walk into the fog, build the map, and explain it to everyone else.
You learn at a steep slope. You reach real competence in an unfamiliar domain fast. We value this over existing expertise.
You carry a pager without flinching. You run the incident, write the postmortem, fix the systemic cause, and move on.
You're fluent with AI tooling. LLM APIs, MCP servers, and agentic frameworks, and you drive Claude Code, Cursor, or similar every day.
You've shipped production automation that other teams depend on, and you're comfortable in any language using AI coding tools.
Bonus: Hardware lifecycle management and RMA automation. BMC/Redfish or IPMI tooling. GPU/TPU qualification or burn-in frameworks. Workflow and orchestration engines (Temporal, Cadence). Metrics and alerting pipelines (Prometheus, Grafana). Go or Python.

Benefits

Comp & perks

Competitive total compensation package (salary + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.