Leidos

Site Reliability Engineer, Artificial Intelligence Engineer

Leidos

full-time

Posted on:

Location Type: Remote

Location: CaliforniaHawaiiUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $131,300 - $237,350 per year

About the role

  • Design, develop, and maintain AI/ML models for anomaly detection, trend analysis, and signal correlation across metrics, logs, traces, and events.
  • Reduce alert noise through intelligent alert grouping, suppression, and prioritization.
  • Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.
  • Implement AI-driven incident classification, enrichment, and summarization.
  • Provide probable root-cause analysis recommendations based on historical and real-time telemetry.
  • Support on-call and incident response teams with AI-guided remediation suggestions.
  • Contribute AI insights to post-incident reviews and reliability improvement plans.
  • Apply AI techniques to identify repetitive operational tasks and automation opportunities.
  • Assist in generating, validating, and optimizing automation playbooks and workflows.
  • Analyze automation execution data to improve success rates, resiliency, and reuse.
  • Build and maintain AI-searchable knowledge repositories containing runbooks, SOPs, lessons learned, and historical incident data.
  • Enable natural-language access to operational knowledge for SREs and operations staff.
  • Develop predictive models for capacity planning, failure forecasting, configuration risk, and reliability debt identification.
  • Support proactive remediation strategies to prevent incidents before customer impact.
  • Assist SRE leadership in data-driven prioritization of reliability investments.
  • Ensure AI solutions adhere to organizational security, compliance, and data-handling policies.
  • Establish guardrails for AI recommendations and automation execution.
  • Promote transparency, explainability, and auditability of AI-driven operational decisions.

Requirements

  • Bachelor’s degree in computer science, Engineering, Information Systems, Data Science, or related discipline
  • 5+ years in Site Reliability Engineering, DevOps, IT Operations, or Systems Engineering
  • 2+ years applying AI/ML techniques in operational, analytics, or automation contexts
  • Demonstrated experience supporting production systems in high-availability environments
  • Must have an active Secret Clearance in order to be considered for the position
  • Proficiency in data analysis tooling
  • Experience with machine learning fundamentals (anomaly detection, clustering, time-series analysis, NLP)
  • Familiarity with observability platforms (metrics, logs, traces, events)
  • Experience with automation frameworks and infrastructure-as-code concepts
  • Strong understanding of distributed systems and operational telemetry
Benefits
  • Competitive compensation
  • Health and Wellness programs
  • Income Protection
  • Paid Leave
  • Retirement
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
AI/ML modelsanomaly detectiontrend analysissignal correlationincident classificationautomation playbookspredictive modelsdata analysismachine learning fundamentalsinfrastructure-as-code
Soft Skills
data-driven prioritizationcommunicationcollaborationproblem-solvingtransparencyexplainabilityauditabilityroot-cause analysisincident responsereliability improvement
Certifications
Bachelor’s degreeSecret Clearance