
Lead Software Engineer – AI Operations and Tooling
The Walt Disney Company
full-time
Posted on:
Location Type: Hybrid
Location: Glendale • California • 🇺🇸 United States
Visit company websiteSalary
💰 $141,900 - $190,300 per year
Job Level
Senior
Tech Stack
AWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaPrometheusPython
About the role
- Define frameworks for AI-specific operations: hallucination/quality testing, evaluation pipelines, and continuous validation.
- Establish reference patterns for scaling LLM services, prompt orchestration, and multi-agent workloads.
- Build automation for safe rollout, monitoring, and incident response.
- Implement end-to-end observability: latency, drift, failure modes, hallucination rates, and GPU/compute utilization.
- Drive cost optimization and efficiency across AI cloud usage (AWS, Azure, GCP).
- Define SLOs, dashboards, and runbooks for AI/LLM production systems.
- Embed compliance, safety checks, and prompt-injection defenses into operational frameworks.
- Mentor engineers in DevOps, infra, and AI operations.
- Drive adoption of best practices for AI reliability, test automation, and incident management.
- Collaborate across AI Core, Data Foundations, Security, and Product teams to ensure operational safety and scale.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related technical field (Master’s preferred), or equivalent experience.
- 7+ years of experience in software engineering, DevOps, or infrastructure, with at least 2 years in a lead role.
- Expert in at least one foundational language (Python, Java, or Go) with production-grade system experience.
- Hands-on experience with cloud-native infrastructure (AWS preferred; Azure/GCP a plus) and modern orchestration platforms.
- Proven experience with observability stacks (Datadog, Prometheus, Grafana) and incident response automation.
- Familiarity with AI/LLM APIs (OpenAI, Anthropic, Bedrock, Azure AI Foundry) and orchestration frameworks (LangChain, LangGraph).
- Strong knowledge of operational AI testing (A/B evaluation, regression, red-teaming) and guardrail enforcement.
- Demonstrated ability to optimize cloud/GPU usage and manage costs at scale.
- Excellent communication skills and proven ability to lead design reviews, mentor engineers, and influence cross-functional teams.
Benefits
- Health insurance
- 401(k) matching
- Flexible work hours
- Paid time off
- Remote work options
- Bonuses
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonJavaGocloud-native infrastructureobservability stacksincident response automationAI/LLM APIsoperational AI testingcost optimizationtest automation
Soft skills
communication skillsleadershipmentoringcollaborationinfluencedesign reviews