
Senior Site Reliability Engineer
ARA
full-time
Posted on:
Location Type: Remote
Location: New Mexico • United States
Visit company websiteExplore more
Job Level
About the role
- Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.
- Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.
- Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.
- Drive continuous improvement in platform stability, maintenance, and availability.
- Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.
Requirements
- 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.
- Strong experience with Linux systems administration and troubleshooting in enterprise environments.
- Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.
- Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.
- Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.
- Experience with GitOps tools such as FluxCD or ArgoCD.
- Proficiency scripting with at least one of Python, Go, or Bash.
- Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.
- Strong understanding of reliability engineering concepts: Service health indicators High availability design, failure reduction, and testing Operational readiness practices, including developing documentation, runbooks, and architectural descriptions Incident response, root cause analysis, remediation/recovery
- Ability to obtain a security clearance, which includes U.S. citizenship.
- Preferred: Experience with multiple Linux distributions including Ubuntu.
- Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.
- Experience with cloud platforms such as AWS and Azure.
- Experience with infrastructure automation and configuration management.
- Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.
- Experience with security and compliance considerations in regulated environments.
- DoD experience.
- Active or inactive Secret Security Clearance.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringDevOpsPlatform EngineeringSystems EngineeringLinux systems administrationKubernetesGitOpsPythonGoBash
Soft Skills
technical supporttroubleshootingcontinuous improvementoperational readinessincident responseroot cause analysisdocumentationcommunication
Certifications
Secret Security Clearance