MeshyAI

AI Infrastructure Engineer

MeshyAI

full-time

Posted on:

Location Type: Hybrid

Location: SunnyvaleCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Own production reliability: availability, latency, error budgets, incident response, postmortems, and follow-ups
  • Build/maintain observability: metrics, logs, traces, alerting, SLOs/SLIs, dashboards
  • Improve deployment safety: CI/CD, rollout strategies (canary/blue-green), automated rollback, runbooks
  • Capacity planning + cost control: GPU/CPU sizing, autoscaling, queue/backpressure management, cost attribution
  • Security + compliance: secrets management, least privilege, patching, vulnerability response
  • Disaster recovery + operational readiness: backups, failover plans, game days
  • Develop and maintain the GPU inference serving stack (APIs, schedulers, workers, batching, caching)

Requirements

  • Linux fundamentals
  • Networking fundamentals
  • Experience with Kubernetes
  • Experience with incident response
  • Experience with observability tools
  • Strong software engineering ability in at least one of: Go / Python
  • Ability to reason about performance tradeoffs and measure before optimizing
Benefits
  • Stock options available for core team members.
  • 401(k) plan for employees.
  • Comprehensive health, dental, and vision insurance.
  • The latest and best office equipment.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GoPythonKubernetesCI/CDobservability toolsGPU inference servingmetricslogstracesincident response
Soft Skills
performance tradeoffsproblem solvinganalytical thinking