The Walt Disney Company

Lead Software Engineer – AI Operations and Tooling

The Walt Disney Company

full-time

Posted on:

Location Type: Hybrid

Location: Glendale • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $141,900 - $190,300 per year

Job Level

Senior

Tech Stack

AWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaPrometheusPython

About the role

  • Define frameworks for AI-specific operations: hallucination/quality testing, evaluation pipelines, and continuous validation.
  • Establish reference patterns for scaling LLM services, prompt orchestration, and multi-agent workloads.
  • Build automation for safe rollout, monitoring, and incident response.
  • Implement end-to-end observability: latency, drift, failure modes, hallucination rates, and GPU/compute utilization.
  • Drive cost optimization and efficiency across AI cloud usage (AWS, Azure, GCP).
  • Define SLOs, dashboards, and runbooks for AI/LLM production systems.
  • Embed compliance, safety checks, and prompt-injection defenses into operational frameworks.
  • Mentor engineers in DevOps, infra, and AI operations.
  • Drive adoption of best practices for AI reliability, test automation, and incident management.
  • Collaborate across AI Core, Data Foundations, Security, and Product teams to ensure operational safety and scale.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related technical field (Master’s preferred), or equivalent experience.
  • 7+ years of experience in software engineering, DevOps, or infrastructure, with at least 2 years in a lead role.
  • Expert in at least one foundational language (Python, Java, or Go) with production-grade system experience.
  • Hands-on experience with cloud-native infrastructure (AWS preferred; Azure/GCP a plus) and modern orchestration platforms.
  • Proven experience with observability stacks (Datadog, Prometheus, Grafana) and incident response automation.
  • Familiarity with AI/LLM APIs (OpenAI, Anthropic, Bedrock, Azure AI Foundry) and orchestration frameworks (LangChain, LangGraph).
  • Strong knowledge of operational AI testing (A/B evaluation, regression, red-teaming) and guardrail enforcement.
  • Demonstrated ability to optimize cloud/GPU usage and manage costs at scale.
  • Excellent communication skills and proven ability to lead design reviews, mentor engineers, and influence cross-functional teams.
Benefits
  • Health insurance
  • 401(k) matching
  • Flexible work hours
  • Paid time off
  • Remote work options
  • Bonuses

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonJavaGocloud-native infrastructureobservability stacksincident response automationAI/LLM APIsoperational AI testingcost optimizationtest automation
Soft skills
communication skillsleadershipmentoringcollaborationinfluencedesign reviews