
Site Reliability Engineer, Artificial Intelligence Engineer
Leidos
full-time
Posted on:
Location Type: Remote
Location: California • Hawaii • United States
Visit company websiteExplore more
Salary
💰 $131,300 - $237,350 per year
Tech Stack
About the role
- Design, develop, and maintain AI/ML models for anomaly detection, trend analysis, and signal correlation across metrics, logs, traces, and events.
- Reduce alert noise through intelligent alert grouping, suppression, and prioritization.
- Enhance observability platforms with AI-generated insights supporting SLO and error-budget management.
- Implement AI-driven incident classification, enrichment, and summarization.
- Provide probable root-cause analysis recommendations based on historical and real-time telemetry.
- Support on-call and incident response teams with AI-guided remediation suggestions.
- Contribute AI insights to post-incident reviews and reliability improvement plans.
- Apply AI techniques to identify repetitive operational tasks and automation opportunities.
- Assist in generating, validating, and optimizing automation playbooks and workflows.
- Analyze automation execution data to improve success rates, resiliency, and reuse.
- Build and maintain AI-searchable knowledge repositories containing runbooks, SOPs, lessons learned, and historical incident data.
- Enable natural-language access to operational knowledge for SREs and operations staff.
- Develop predictive models for capacity planning, failure forecasting, configuration risk, and reliability debt identification.
- Support proactive remediation strategies to prevent incidents before customer impact.
- Assist SRE leadership in data-driven prioritization of reliability investments.
- Ensure AI solutions adhere to organizational security, compliance, and data-handling policies.
- Establish guardrails for AI recommendations and automation execution.
- Promote transparency, explainability, and auditability of AI-driven operational decisions.
Requirements
- Bachelor’s degree in computer science, Engineering, Information Systems, Data Science, or related discipline
- 5+ years in Site Reliability Engineering, DevOps, IT Operations, or Systems Engineering
- 2+ years applying AI/ML techniques in operational, analytics, or automation contexts
- Demonstrated experience supporting production systems in high-availability environments
- Must have an active Secret Clearance in order to be considered for the position
- Proficiency in data analysis tooling
- Experience with machine learning fundamentals (anomaly detection, clustering, time-series analysis, NLP)
- Familiarity with observability platforms (metrics, logs, traces, events)
- Experience with automation frameworks and infrastructure-as-code concepts
- Strong understanding of distributed systems and operational telemetry
Benefits
- Competitive compensation
- Health and Wellness programs
- Income Protection
- Paid Leave
- Retirement
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI/ML modelsanomaly detectiontrend analysissignal correlationincident classificationautomation playbookspredictive modelsdata analysismachine learning fundamentalsinfrastructure-as-code
Soft Skills
data-driven prioritizationcommunicationcollaborationproblem-solvingtransparencyexplainabilityauditabilityroot-cause analysisincident responsereliability improvement
Certifications
Bachelor’s degreeSecret Clearance