ARA

Senior Site Reliability Engineer

ARA

full-time

Posted on:

Location Type: Remote

Location: New MexicoUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.
  • Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.
  • Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.
  • Drive continuous improvement in platform stability, maintenance, and availability.
  • Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.

Requirements

  • 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.
  • Strong experience with Linux systems administration and troubleshooting in enterprise environments.
  • Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.
  • Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.
  • Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.
  • Experience with GitOps tools such as FluxCD or ArgoCD.
  • Proficiency scripting with at least one of Python, Go, or Bash.
  • Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.
  • Strong understanding of reliability engineering concepts: Service health indicators High availability design, failure reduction, and testing Operational readiness practices, including developing documentation, runbooks, and architectural descriptions Incident response, root cause analysis, remediation/recovery
  • Ability to obtain a security clearance, which includes U.S. citizenship.
  • Preferred: Experience with multiple Linux distributions including Ubuntu.
  • Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.
  • Experience with cloud platforms such as AWS and Azure.
  • Experience with infrastructure automation and configuration management.
  • Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.
  • Experience with security and compliance considerations in regulated environments.
  • DoD experience.
  • Active or inactive Secret Security Clearance.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringDevOpsPlatform EngineeringSystems EngineeringLinux systems administrationKubernetesGitOpsPythonGoBash
Soft Skills
technical supporttroubleshootingcontinuous improvementoperational readinessincident responseroot cause analysisdocumentationcommunication
Certifications
Secret Security Clearance