CakeAI

Staff Software Engineer, ML Platform

CakeAI

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Build Enterprise-Scale Infrastructure
  • Leverage infrastructure-as-code to manage complex cloud environments supporting critical ML and AI initiatives.
  • Design Kubernetes-native systems, including controllers/operators where appropriate.
  • Improve platform networking, security, and observability
  • Sustain Platform Health and Performance
  • Own critical systems in production, including reliability, scalability, security, and cost efficiency.
  • Identify and proactively address technical debt, operational risk, and platform bottlenecks.
  • “Learn by doing” — Quickly ramp up to a complex tech stack (Terraform, Kubernetes, Istio, Crossplane, Go, TypeScript)
  • Enable Teams and Customers to Move Faster
  • Create abstractions and tooling that make it easier for teams and customers to deploy, run, and scale AI/ML workloads.
  • Collaborate directly with customers to understand their ML infrastructure challenges and translate them into platform improvements.
  • Balance speed and rigor—shipping quickly while maintaining a high bar for quality and safety.
  • Lead Through Influence
  • Act as a technical leader and mentor across the engineering organization.
  • Write clear documentation and design proposals that align stakeholders and drive decisions.
  • Partner closely with product and leadership to shape platform direction and priorities.

Requirements

  • 10+ years of engineering experience, with significant time spent on infrastructure, platform, or distributed systems.
  • Deep hands-on experience with Kubernetes in production environments.
  • Strong cloud experience across AWS, GCP, and/or Azure.
  • Proven track record of building and operating secure, scalable MLOps platforms.
  • Deep understanding of infrastructure-as-code (e.g., Terraform, Pulumi, CDK).
  • Strong programming skills in at least one backend language (Go preferred; TypeScript also welcome).
  • Experience diagnosing and debugging complex production issues.
  • Familiarity with modern CI/CD, test-driven development, and DevSecOps practices.
  • Bonus: experience building Kubernetes operators and/or working with service meshes (e.g., Istio).
  • Comfortable owning large, ambiguous problems from inception to production.
  • Excellent communicator, able to clearly explain complex systems to both technical and non-technical audiences.
  • Experience working directly with customers and incorporating feedback into technical decisions.
  • Ability to operate autonomously while keeping stakeholders informed and aligned.
  • Customer-first and product-oriented.
  • Curious, adaptable, and eager to learn new systems and domains.
  • Collaborative, respectful, and willing to lean into hard conversations.
  • Energized by fast-paced environments and meaningful responsibility.
Benefits
  • Competitive cash compensation alongside above-market equity upside
  • Top-tier fully covered medical, dental, and vision insurance
  • Life insurance
  • 401k program
  • Unlimited PTO
  • Monthly half day
  • Citi Bike membership
  • Monthly wellness stipend
  • Office equipment stipend, including reimbursement for approved disability-related accommodations
  • Investment in employee learning and growth opportunities

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
KubernetesTerraformGoTypeScriptMLOpsinfrastructure-as-codeCI/CDDevSecOpsdebuggingcloud computing
Soft skills
communicationcollaborationleadershipproblem-solvingadaptabilitycustomer-orientedautonomymentorshipdocumentationinfluence