Wand AI

Senior Site Reliability Engineer

Wand AI

full-time

Posted on:

Location Type: Remote

Location: United States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Build, maintain, and operate scalable production infrastructure.
  • Own reliability and availability for key services and environments.
  • Contribute to the design and operation of Kubernetes-based infrastructure.
  • Develop and maintain Infrastructure-as-Code frameworks (e.g., Terraform).
  • Improve monitoring, alerting, and observability across systems.
  • Participate in on-call rotations and respond to production incidents.
  • Investigate root causes of incidents and contribute to postmortems and reliability improvements.
  • Improve system performance, availability, and fault tolerance.
  • Contribute to CI/CD pipeline improvements to increase release safety and predictability.
  • Support the deployment and operation of data platforms and ML workloads.
  • Help standardize environments and infrastructure across internal systems and customer deployments.
  • Troubleshoot issues across infrastructure, services, and deployment pipelines.
  • Work closely with QA and engineering teams to improve production readiness and release stability.
  • Contribute to automation efforts that reduce operational toil.

Requirements

  • Strong hands-on experience in Site Reliability Engineering, DevOps roles.
  • Experience working with cloud infrastructure (AWS preferred).
  • Experience operating production systems and responding to incidents.
  • Experience with Kubernetes in production environments.
  • Strong experience with Infrastructure-as-Code (Terraform or similar).
  • Experience working with CI/CD pipelines and deployment automation.
  • Experience with monitoring, logging, and observability tooling.
  • Strong troubleshooting and debugging skills in distributed systems.
  • Experience supporting data platforms or ML workloads in production environments.
  • Strong collaboration and communication skills.
Benefits
  • Flexible working hours
  • Professional development opportunities
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringDevOpsKubernetesInfrastructure-as-CodeTerraformCI/CD pipelinesmonitoringloggingobservabilitytroubleshooting
Soft Skills
collaborationcommunicationproblem-solvingincident responseroot cause analysisreliability improvementproduction readinessrelease stabilityautomationteamwork