
Principal Site Reliability Engineer
Zefr
full-time
Posted on:
Location Type: Hybrid
Location: Marina del Rey • California • United States
Visit company websiteExplore more
Salary
💰 $210,000 - $235,000 per year
Job Level
About the role
- Support and build systems and tools that enable other engineers to generate, deploy, and manage product features and models both quickly and safely.
- Deploy and support a multi-cloud, micro-service architecture, including infrastructure tailored for ML workloads, deployed via Github Actions, ArgoCD & Kubernetes.
- Collaborate with other engineers to architect secure, resilient, scalable, and cost-efficient applications and ML systems/pipelines in AWS and GCP.
- Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams.
- Proactively maintain the health of production environments, including monitoring application performance and resource utilization.
- Participate in 24/7 on-call rotation, respond to system performance issues and outages.
- Debug code at the application and infrastructure level.
- Mature our CI/CD workflows and release process.
- Maintains a forward-thinking approach, actively researching and proposing new solutions.
- Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices.
Requirements
- 10+ year job history designing, managing, deploying, and supporting Cloud Infrastructure in a production environment using major public cloud providers (GCP experience a huge bonus)
- Experience in Advertising or AdTech
- Demonstrated technical leadership experience; including mentoring engineers, driving cross-functional projects, and influencing architectural decisions at an organizational level.
- Knowledge of GitOps including an understanding of modern CI/CD pipelines, techniques and technologies (Github Actions, GitLab, CircleCI, Argo CD, Flux)
- Advanced Proficiency with IaC and configuration management tools (Terraform, Terragrunt, OpenTofu, Crossplane, Pulumi)
- Deep production experience architecting, managing, deploying, and supporting container based workloads into Kubernetes clusters
- Proven track record of building and scaling reliability practices, including SLO/SLI frameworks, incident management, and capacity planning.
- Heavy Production experience with observability platforms and practices (Prometheus, Grafana, Chronosphere, Datadog, OpenTelemetry); ability to design monitoring strategies for complex distributed systems.
- Strong knowledge of cloud networking (Mesh, NAT, Load Balancers, API Gateways, proxies, etc), cloud security, and cost optimization strategies.
- Exceptional written and verbal communication skills; ability to translate complex technical concepts for diverse audiences and build consensus across teams.
- Experience authoring technical strategy documents, RFCs, and architectural proposals.
Benefits
- Flexible PTO
- Medical, dental, and vision insurance with FSA options
- Company-paid life insurance
- Paid parental leave
- 401(k) with company match
- Professional development opportunities
- 13 paid holidays off
- Summer Fridays (we leave early)
- In-office, hybrid, and fully-remote work options available
- In-office lunches and lots of free food
- Optional in-person and virtual events (we like to celebrate!)
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Cloud InfrastructureCI/CDGitOpsIaCKubernetesTerraformPrometheusGrafanaIncident ManagementCapacity Planning
Soft Skills
Technical LeadershipMentoringCommunicationCollaborationProblem SolvingContinuous ImprovementConsensus BuildingResearchProactive MaintenanceArchitectural Decision Making