Ensure reliability, scalability, and performance of services through SLIs/SLOs, capacity planning, and incident response
Drive automation of infrastructure operations to minimize toil
Develop and support monitoring, alerting, and observability systems to support proactive issue detection
Partner with internal engineering teams to define service-level objectives, improve deployment workflows, and integrate infrastructure with development needs
Contribute to on-call rotations and incident management, helping ensure high availability of services
Drive post-incident reviews and blameless retrospectives to improve reliability
Stay current with emerging technologies and recommend improvements to existing systems and practices.

Requirements

3+ years of experience as an SRE, DevOps Engineer, or Infrastructure Engineer
Solid experience with Kubernetes administration and tooling (e.g., Helm, ArgoCD, Kustomize)
Strong expertise in cloud platforms (e.g., AWS, GCP, or Azure)
Experience managing databases in production environments (e.g., backups, replication, tuning)
Proficiency in programming or scripting (e.g., Go, Python, Bash)
Deep understanding of CI/CD pipelines and infrastructure automation
Familiarity with monitoring/observability tools (e.g., Prometheus, Grafana)
Strong communication skills and ability to collaborate with software engineering teams.

Benefits

Health insurance
Dental insurance
401k plan
Tuition reimbursement
Professional development reimbursement
Flexible work arrangements

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

KubernetesAWSGCPAzureGoPythonBashCI/CDinfrastructure automationdatabase management

Soft skills

communicationcollaborationincident managementreliability improvementproactive issue detectioncapacity planningblameless retrospectivesteam partnershipautomation driveemerging technology awareness