
AI Infrastructure Engineer
MeshyAI
full-time
Posted on:
Location Type: Hybrid
Location: Sunnyvale • California • United States
Visit company websiteExplore more
Tech Stack
About the role
- Own production reliability: availability, latency, error budgets, incident response, postmortems, and follow-ups
- Build/maintain observability: metrics, logs, traces, alerting, SLOs/SLIs, dashboards
- Improve deployment safety: CI/CD, rollout strategies (canary/blue-green), automated rollback, runbooks
- Capacity planning + cost control: GPU/CPU sizing, autoscaling, queue/backpressure management, cost attribution
- Security + compliance: secrets management, least privilege, patching, vulnerability response
- Disaster recovery + operational readiness: backups, failover plans, game days
- Develop and maintain the GPU inference serving stack (APIs, schedulers, workers, batching, caching)
Requirements
- Linux fundamentals
- Networking fundamentals
- Experience with Kubernetes
- Experience with incident response
- Experience with observability tools
- Strong software engineering ability in at least one of: Go / Python
- Ability to reason about performance tradeoffs and measure before optimizing
Benefits
- Stock options available for core team members.
- 401(k) plan for employees.
- Comprehensive health, dental, and vision insurance.
- The latest and best office equipment.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GoPythonKubernetesCI/CDobservability toolsGPU inference servingmetricslogstracesincident response
Soft Skills
performance tradeoffsproblem solvinganalytical thinking