
Senior Site Reliability Engineer
Wand AI
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Job Level
About the role
- Build, maintain, and operate scalable production infrastructure.
- Own reliability and availability for key services and environments.
- Contribute to the design and operation of Kubernetes-based infrastructure.
- Develop and maintain Infrastructure-as-Code frameworks (e.g., Terraform).
- Improve monitoring, alerting, and observability across systems.
- Participate in on-call rotations and respond to production incidents.
- Investigate root causes of incidents and contribute to postmortems and reliability improvements.
- Improve system performance, availability, and fault tolerance.
- Contribute to CI/CD pipeline improvements to increase release safety and predictability.
- Support the deployment and operation of data platforms and ML workloads.
- Help standardize environments and infrastructure across internal systems and customer deployments.
- Troubleshoot issues across infrastructure, services, and deployment pipelines.
- Work closely with QA and engineering teams to improve production readiness and release stability.
- Contribute to automation efforts that reduce operational toil.
Requirements
- Strong hands-on experience in Site Reliability Engineering, DevOps roles.
- Experience working with cloud infrastructure (AWS preferred).
- Experience operating production systems and responding to incidents.
- Experience with Kubernetes in production environments.
- Strong experience with Infrastructure-as-Code (Terraform or similar).
- Experience working with CI/CD pipelines and deployment automation.
- Experience with monitoring, logging, and observability tooling.
- Strong troubleshooting and debugging skills in distributed systems.
- Experience supporting data platforms or ML workloads in production environments.
- Strong collaboration and communication skills.
Benefits
- Flexible working hours
- Professional development opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringDevOpsKubernetesInfrastructure-as-CodeTerraformCI/CD pipelinesmonitoringloggingobservabilitytroubleshooting
Soft Skills
collaborationcommunicationproblem-solvingincident responseroot cause analysisreliability improvementproduction readinessrelease stabilityautomationteamwork