Baseten

Customer Engineer

Baseten

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $165,000 - $330,000 per year

About the role

  • Serve as the first responder to all post-sales customer issues via ticketing (Pylon) and Slack, triaging and resolving Tier 1 and Tier 2 issues independently.
  • Diagnose runtime issues related to latency, memory behavior, GPU utilization, concurrency, and model lifecycle management.
  • Debug infrastructure problems across Kubernetes (pods, controllers), networking, observability, and alerting systems.
  • Pull logs, read error traces, and correlate signals across Grafana, Loki, and Prometheus to pinpoint root causes — even when the real issue is buried layers deep.
  • Lead incident response during outages and escalations, coordinating across Product, SRE, Sales, and Engineering.
  • Own customer communication through resolution — even when the fix is handed off to SRE or Infra — including delivering root-cause analyses after every P0/P1.
  • Escalate to SRE/ other engineering teams with structured context (customer, affected models, what you've already ruled out, specific ask) so nothing gets lost in translation.
  • Drive post-incident alerting reviews: why did the customer find this before we did, and what instrumentation or process change prevents it next time?
  • Serve as the technical owner for top enterprise accounts with strict SLAs and high responsiveness expectations.
  • Set up and maintain proactive monitoring and alerts for all customer production models within 24 hours of handoff from SA(Solution Architect).
  • Drive the QBR process and proactive reengagement for expansion opportunities.
  • Track recurring failure patterns across accounts and push for durable fixes — not just incident closure.
  • Monitor internal feedback channels and route product-level issues to the right teams.
  • Own the SA-to-CE handoff for new customers: validate architecture, confirm production-readiness milestones, and establish escalation paths.
  • Maintain and improve runbooks, knowledge bases, and diagnostic best practices so the team scales with the customer base.
  • Translate user feedback into roadmap signals, documentation improvements, and product enhancements.
  • Coordinate end-to-end on projects spanning feature requests, new deployments, and operational debugging — scoping, execution, communication, and stakeholder alignment.

Requirements

  • Deep Kubernetes troubleshooting expertise, including resource debugging, pod/runtime analysis, and log-based diagnostics with observability tooling (Grafana, Loki, Prometheus).
  • Strong infrastructure debugging across container orchestration, networking, and service dependencies, with hands-on production cluster experience.
  • Experience managing high-severity incidents with major customers — SLAs, war rooms, post-incident reviews, and clear executive-level communication throughout.
  • Proven project management skills with an ownership mindset: you can run multiple complex, multi-stakeholder initiatives in parallel without dropping threads.
  • Ability to translate recurring technical pain points into roadmap-level insights and product improvements.
  • Strong communication skills and executive presence during high-visibility situations, ensuring both technical clarity and customer confidence.
  • 3+ years of experience in a fast-paced, high-growth, or customer-facing engineering environment.
Benefits
  • Competitive compensation, including meaningful equity.
  • 100% coverage of medical, dental, and vision insurance for employee and dependents
  • Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
  • Paid parental leave
  • Company-facilitated 401(k)
  • Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Kubernetestroubleshootingresource debuggingpod analysislog-based diagnosticsobservability toolingincident managementproject managementmonitoringalerting
Soft Skills
communicationexecutive presenceownership mindsetstakeholder alignmentcustomer confidenceproblem-solvingincident response leadershipfeedback translationmulti-stakeholder initiative managementtechnical clarity