
Staff Software Engineer, ML Platform
CakeAI
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Job Level
About the role
- Build Enterprise-Scale Infrastructure
- Leverage infrastructure-as-code to manage complex cloud environments supporting critical ML and AI initiatives.
- Design Kubernetes-native systems, including controllers/operators where appropriate.
- Improve platform networking, security, and observability
- Sustain Platform Health and Performance
- Own critical systems in production, including reliability, scalability, security, and cost efficiency.
- Identify and proactively address technical debt, operational risk, and platform bottlenecks.
- “Learn by doing” — Quickly ramp up to a complex tech stack (Terraform, Kubernetes, Istio, Crossplane, Go, TypeScript)
- Enable Teams and Customers to Move Faster
- Create abstractions and tooling that make it easier for teams and customers to deploy, run, and scale AI/ML workloads.
- Collaborate directly with customers to understand their ML infrastructure challenges and translate them into platform improvements.
- Balance speed and rigor—shipping quickly while maintaining a high bar for quality and safety.
- Lead Through Influence
- Act as a technical leader and mentor across the engineering organization.
- Write clear documentation and design proposals that align stakeholders and drive decisions.
- Partner closely with product and leadership to shape platform direction and priorities.
Requirements
- 10+ years of engineering experience, with significant time spent on infrastructure, platform, or distributed systems.
- Deep hands-on experience with Kubernetes in production environments.
- Strong cloud experience across AWS, GCP, and/or Azure.
- Proven track record of building and operating secure, scalable MLOps platforms.
- Deep understanding of infrastructure-as-code (e.g., Terraform, Pulumi, CDK).
- Strong programming skills in at least one backend language (Go preferred; TypeScript also welcome).
- Experience diagnosing and debugging complex production issues.
- Familiarity with modern CI/CD, test-driven development, and DevSecOps practices.
- Bonus: experience building Kubernetes operators and/or working with service meshes (e.g., Istio).
- Comfortable owning large, ambiguous problems from inception to production.
- Excellent communicator, able to clearly explain complex systems to both technical and non-technical audiences.
- Experience working directly with customers and incorporating feedback into technical decisions.
- Ability to operate autonomously while keeping stakeholders informed and aligned.
- Customer-first and product-oriented.
- Curious, adaptable, and eager to learn new systems and domains.
- Collaborative, respectful, and willing to lean into hard conversations.
- Energized by fast-paced environments and meaningful responsibility.
Benefits
- Competitive cash compensation alongside above-market equity upside
- Top-tier fully covered medical, dental, and vision insurance
- Life insurance
- 401k program
- Unlimited PTO
- Monthly half day
- Citi Bike membership
- Monthly wellness stipend
- Office equipment stipend, including reimbursement for approved disability-related accommodations
- Investment in employee learning and growth opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
KubernetesTerraformGoTypeScriptMLOpsinfrastructure-as-codeCI/CDDevSecOpsdebuggingcloud computing
Soft skills
communicationcollaborationleadershipproblem-solvingadaptabilitycustomer-orientedautonomymentorshipdocumentationinfluence