Staff Platform Engineer

EarnIn

Staff Platform Engineer leading AI-driven workflows at EarnIn for cloud infrastructure. Mentoring engineers and shaping a developer self-service platform with a focus on operational efficiency.

Posted 6/4/2026full-timeRemote • 🇲🇽 MexicoLeadWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

cloud infrastructureAI-driven systemsinfrastructure-as-codeAWSKubernetesTerraformAnsiblePythonGoMLOps

Soft Skills

mentoringleadershipcross-functional collaborationcommunicationproblem-solvingdocumentationgovernanceinitiative drivingfeedback incorporationstakeholder liaison

Tools & Technologies

DatadogGitHub CopilotCursorChatGPTLinkerdIstioArgo CDFlux CDGitHub ActionsArgo Workflows

Certifications & Qualifications

Bachelor's degree in Computer ScienceMaster's degree in Engineering

Industry Keywords

agentic systemshigh-availability distributed systemsobservabilityAI governancesecurity best practicesLLM orchestrationproduction monitoringdata governancemodel safetyprompt engineering

Tech Stack

Tools & technologies

AnsibleAWSCloudDistributed SystemsFluxGoKubernetesPythonTerraform

About the role

Key responsibilities & impact

Design foundational patterns and guardrails for how EarnIn builds, evaluates, monitors, and deploys AI agents in production.
Own agent governance, including model selection, evaluation frameworks, safety guidelines, and production observability.
Establish infrastructure-as-code best practices for agentic systems, ensuring prompts, tools, and evaluation criteria are versioned, reviewed, and tested like critical components.
Serve as architect in agentic cloud infrastructure, establishing best practices for production AI agents.
Mentor senior engineers in advanced agentic patterns, LLM integration, and production prompt engineering.
Lead cross-functional initiatives with engineering, product, security, and business teams to align agentic AI adoption with company objectives.
Oversee large-scale, high-availability distributed systems on AWS, identifying and solving critical performance, scalability, and stability challenges.
Use AI-driven observability and anomaly detection to anticipate failures.
Lead the evolution of infrastructure-as-code and automation standards, incorporating agentic pattern recognition and automated remediation into operations.
Shape the evolution of our developer control plane (Cortex) as an AI-augmented self-service platform where engineers interact with intelligent assistants.
Drive AI-powered golden paths that encode platform standards, security policies, and best practices.
Act as liaison between cloud operations, AI infrastructure, and business stakeholders.
Develop documentation on agentic architecture, best practices, and operational procedures.
Participate in and lead on-call rotations, using post-mortems as feedback loops for improving system reliability and agentic automation.

Requirements

What you’ll need

Bachelor's or Master's degree in Computer Science, Engineering, or related field.
7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems.
Proven experience mentoring senior engineers and leading company-wide platform initiatives across multiple teams and functions.
Demonstrated experience architecting and scaling AI-driven systems in production, designing multi-step agentic workflows that autonomously perform complex operational tasks.
Track record of eliminating high-friction operational workflows through agentic AI, with measurable reduction in toil and increased platform leverage (e.g., LLM-powered incident diagnosis, intelligent CI/CD with test selection and deployment risk scoring, self-service assistants).
Mastery of AWS (EKS, Lambda, Bedrock, etc.) and deep expertise in containerized and serverless architectures.
Strong expertise in Kubernetes at scale and ability to guide implementation of complex, resilient solutions.
Deep knowledge of infrastructure-as-code tools (Terraform, Ansible) and ability to lead initiatives incorporating both traditional IaC and agentic automation.
Mastery of Datadog and advanced observability, driving metrics-driven decisions and agentic automation. Experience building AI-driven alerting and root-cause analysis systems is a plus.
Strong adherence to security, privacy, and compliance best practices, with the ability to lead governance for production AI systems (model safety, prompt injection prevention, data isolation).
Experience with LLM orchestration frameworks (LangChain, LlamaIndex, CrewAI, or custom agentic architectures) and production prompt engineering at scale.
Strong coding expertise in Python and/or Go, with the ability to guide teams in treating infrastructure and agentic systems as software.
Proven ability to drive cross-functional initiatives across engineering, product, security, and business, translating between technical depth and business impact.
Experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, ChatGPT, or similar tools) as part of your software development workflow?
Experience with service mesh (Linkerd, Istio) and traffic management at scale is a plus.
Proficiency with GitOps (Argo CD, Flux CD) and CI/CD orchestration (GitHub Actions, Argo Workflows) is a plus.
Experience with MLOps or LLMOps concepts (model versioning, evaluation frameworks, production monitoring for AI systems) is a plus.
Familiarity with security frameworks relevant to AI systems (e.g., guardrails, audit logging, and data governance for LLMs) is a plus.

Benefits

Comp & perks

healthcare
internet and cell phone reimbursement
learning and development stipend
potential opportunities to travel to our Mountain View headquarters