FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAnsibleAWSCloudDistributed SystemsFluxGoKubernetesPythonTerraform
About the role
Key responsibilities & impact- Design foundational patterns and guardrails for how EarnIn builds, evaluates, monitors, and deploys AI agents in production.
- Own agent governance, including model selection, evaluation frameworks, safety guidelines, and production observability.
- Establish infrastructure-as-code best practices for agentic systems, ensuring prompts, tools, and evaluation criteria are versioned, reviewed, and tested like critical components.
- Serve as architect in agentic cloud infrastructure, establishing best practices for production AI agents.
- Mentor senior engineers in advanced agentic patterns, LLM integration, and production prompt engineering.
- Lead cross-functional initiatives with engineering, product, security, and business teams to align agentic AI adoption with company objectives.
- Oversee large-scale, high-availability distributed systems on AWS, identifying and solving critical performance, scalability, and stability challenges.
- Use AI-driven observability and anomaly detection to anticipate failures.
- Lead the evolution of infrastructure-as-code and automation standards, incorporating agentic pattern recognition and automated remediation into operations.
- Shape the evolution of our developer control plane (Cortex) as an AI-augmented self-service platform where engineers interact with intelligent assistants.
- Drive AI-powered golden paths that encode platform standards, security policies, and best practices.
- Act as liaison between cloud operations, AI infrastructure, and business stakeholders.
- Develop documentation on agentic architecture, best practices, and operational procedures.
- Participate in and lead on-call rotations, using post-mortems as feedback loops for improving system reliability and agentic automation.
Requirements
What you’ll need- Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems.
- Proven experience mentoring senior engineers and leading company-wide platform initiatives across multiple teams and functions.
- Demonstrated experience architecting and scaling AI-driven systems in production, designing multi-step agentic workflows that autonomously perform complex operational tasks.
- Track record of eliminating high-friction operational workflows through agentic AI, with measurable reduction in toil and increased platform leverage (e.g., LLM-powered incident diagnosis, intelligent CI/CD with test selection and deployment risk scoring, self-service assistants).
- Mastery of AWS (EKS, Lambda, Bedrock, etc.) and deep expertise in containerized and serverless architectures.
- Strong expertise in Kubernetes at scale and ability to guide implementation of complex, resilient solutions.
- Deep knowledge of infrastructure-as-code tools (Terraform, Ansible) and ability to lead initiatives incorporating both traditional IaC and agentic automation.
- Mastery of Datadog and advanced observability, driving metrics-driven decisions and agentic automation. Experience building AI-driven alerting and root-cause analysis systems is a plus.
- Strong adherence to security, privacy, and compliance best practices, with the ability to lead governance for production AI systems (model safety, prompt injection prevention, data isolation).
- Experience with LLM orchestration frameworks (LangChain, LlamaIndex, CrewAI, or custom agentic architectures) and production prompt engineering at scale.
- Strong coding expertise in Python and/or Go, with the ability to guide teams in treating infrastructure and agentic systems as software.
- Proven ability to drive cross-functional initiatives across engineering, product, security, and business, translating between technical depth and business impact.
- Experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, ChatGPT, or similar tools) as part of your software development workflow?
- Experience with service mesh (Linkerd, Istio) and traffic management at scale is a plus.
- Proficiency with GitOps (Argo CD, Flux CD) and CI/CD orchestration (GitHub Actions, Argo Workflows) is a plus.
- Experience with MLOps or LLMOps concepts (model versioning, evaluation frameworks, production monitoring for AI systems) is a plus.
- Familiarity with security frameworks relevant to AI systems (e.g., guardrails, audit logging, and data governance for LLMs) is a plus.
Benefits
Comp & perks- healthcare
- internet and cell phone reimbursement
- learning and development stipend
- potential opportunities to travel to our Mountain View headquarters
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
cloud infrastructureAI-driven systemsinfrastructure-as-codeAWSKubernetesTerraformAnsiblePythonGoMLOps
Soft Skills
mentoringleadershipcross-functional collaborationcommunicationproblem-solvingdocumentationgovernanceinitiative drivingfeedback incorporationstakeholder liaison
Certifications
Bachelor's degree in Computer ScienceMaster's degree in Engineering
