
Staff Site Reliability Engineer
Zefr
full-time
Posted on:
Location Type: Hybrid
Location: Marina del Rey • California • United States
Visit company websiteExplore more
Salary
💰 $190,000 - $210,000 per year
Job Level
About the role
- Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
- Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
- Collaborate with other engineers, particularly the Machine Learning team, to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
- Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
- Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
- Participate in 24/7 on-call rotation, respond to system performance issues and outages.
- Debug code at the application and infrastructure level.
- Mature our CI/CD workflows and release process.
- Maintains a forward-thinking approach, actively researching and proposing new solutions.
- Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.
Requirements
- 7+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
- Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
- Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
- Production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
- Strong problem-solving experience, focusing on automation
- Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
- Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
- Knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
- Strong written and verbal communication, organization, and documentation skills
Benefits
- Flexible PTO
- Medical, dental, and vision insurance with FSA options
- Company-paid life insurance
- Paid parental leave
- 401(k) with company match
- Professional development opportunities
- 10+ paid holidays off
- Summer Fridays (we leave early)
- In-office, hybrid, and fully-remote work options available
- In-office lunches and lots of free food
- Optional in-person and virtual events (we like to celebrate!)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Cloud InfrastructureCI/CDGitOpsIaCKubernetesTerraformPrometheusGrafanaAWSGCP
Soft Skills
problem-solvingcommunicationorganizationdocumentationcollaborationcontinuous improvementproactive maintenanceincident managementcapacity planningforward-thinking