CodeRabbit

Site Reliability Engineer – Platform

CodeRabbit

full-time

Posted on:

Location Type: Hybrid

Location: Bay AreaCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design, implement, and maintain scalable infrastructure on Google Cloud Platform to support CodeRabbit's growing user base and processing demands
  • Own and operate critical platform services
  • Build and maintain Infrastructure as Code using Terraform to ensure consistent, reproducible, and version-controlled infrastructure deployments
  • Establish and maintain SLI/SLO frameworks for all critical services, ensuring we meet our reliability commitments to users
  • Implement comprehensive monitoring, alerting, and observability solutions using Datadog and custom instrumentation
  • Conduct thorough incident response, root cause analysis, and post-mortem processes to continuously improve system reliability
  • Optimize application and infrastructure performance to handle millions of pull request analyses with minimal latency
  • Develop self-service platforms and tooling that empower engineering teams to deploy, monitor, and troubleshoot their services independently
  • Integrate security best practices into all infrastructure and platform services
  • Establish and maintain disaster recovery procedures and business continuity planning

Requirements

  • 6-8 years of hands-on experience in Site Reliability Engineering, Platform Engineering, or DevOps Engineering roles
  • Proven track record of managing production systems at scale, preferably in high-growth technology companies
  • Experience with cloud platforms, particularly AWS or Google Cloud Platform (GCP), including compute, storage, networking, and managed services
  • Strong background in containerization and orchestration platforms (Kubernetes, Docker)
  • Proficiency in Node.js and TypeScript for building automation tools, monitoring solutions, and platform services
  • Advanced experience with Terraform for infrastructure provisioning and management
  • Hands-on experience with Datadog or similar platforms (Prometheus, Grafana, ELK stack) for observability
  • Comprehensive experience with GCP services including Compute Engine, GKE, Cloud Run, Cloud SQL, Cloud Storage, Load Balancing, and IAM
  • Strong Linux/Unix systems skills
  • Experience with network protocols, load balancing, and CDN technologies
  • Knowledge of security principles and best practices for cloud infrastructure
  • Familiarity with CI/CD tools and practices (Jenkins, GitLab CI, GitHub Actions)
  • Understanding of microservices architecture and distributed systems principles
Benefits
  • Competitive salary, equity, and benefits
  • Professional development opportunities
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringPlatform EngineeringDevOps EngineeringGoogle Cloud PlatformAWSTerraformNode.jsTypeScriptKubernetesDocker
Soft Skills
incident responseroot cause analysispost-mortem processessystem reliability improvementdisaster recovery planningbusiness continuity planning