
Site Reliability Engineer – Platform
CodeRabbit
full-time
Posted on:
Location Type: Hybrid
Location: Bay Area • California • United States
Visit company websiteExplore more
Tech Stack
About the role
- Design, implement, and maintain scalable infrastructure on Google Cloud Platform to support CodeRabbit's growing user base and processing demands
- Own and operate critical platform services
- Build and maintain Infrastructure as Code using Terraform to ensure consistent, reproducible, and version-controlled infrastructure deployments
- Establish and maintain SLI/SLO frameworks for all critical services, ensuring we meet our reliability commitments to users
- Implement comprehensive monitoring, alerting, and observability solutions using Datadog and custom instrumentation
- Conduct thorough incident response, root cause analysis, and post-mortem processes to continuously improve system reliability
- Optimize application and infrastructure performance to handle millions of pull request analyses with minimal latency
- Develop self-service platforms and tooling that empower engineering teams to deploy, monitor, and troubleshoot their services independently
- Integrate security best practices into all infrastructure and platform services
- Establish and maintain disaster recovery procedures and business continuity planning
Requirements
- 6-8 years of hands-on experience in Site Reliability Engineering, Platform Engineering, or DevOps Engineering roles
- Proven track record of managing production systems at scale, preferably in high-growth technology companies
- Experience with cloud platforms, particularly AWS or Google Cloud Platform (GCP), including compute, storage, networking, and managed services
- Strong background in containerization and orchestration platforms (Kubernetes, Docker)
- Proficiency in Node.js and TypeScript for building automation tools, monitoring solutions, and platform services
- Advanced experience with Terraform for infrastructure provisioning and management
- Hands-on experience with Datadog or similar platforms (Prometheus, Grafana, ELK stack) for observability
- Comprehensive experience with GCP services including Compute Engine, GKE, Cloud Run, Cloud SQL, Cloud Storage, Load Balancing, and IAM
- Strong Linux/Unix systems skills
- Experience with network protocols, load balancing, and CDN technologies
- Knowledge of security principles and best practices for cloud infrastructure
- Familiarity with CI/CD tools and practices (Jenkins, GitLab CI, GitHub Actions)
- Understanding of microservices architecture and distributed systems principles
Benefits
- Competitive salary, equity, and benefits
- Professional development opportunities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringPlatform EngineeringDevOps EngineeringGoogle Cloud PlatformAWSTerraformNode.jsTypeScriptKubernetesDocker
Soft Skills
incident responseroot cause analysispost-mortem processessystem reliability improvementdisaster recovery planningbusiness continuity planning