
Staff Software Engineer, Site Reliability (SRE)
Character.AI
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • 🇺🇸 United States
Visit company websiteSalary
💰 $150,000 - $300,000 per year
Job Level
Lead
Tech Stack
CloudGoGoogle Cloud PlatformGrafanaKubernetesLinuxNode.jsPrometheusPythonSQLTerraform
About the role
- Maintain production services and keep them operational.
- Develop tools, Instrumentation and automation to monitor and optimize the performance and reliability of our service.
- Develop, implement and maintain automation tools and processes to prevent and mitigate service disruptions.
- Collaborate with development teams to design and implement scalable, reliable systems, CI/CD processes for deployment.
- Establish and support SLAs and SLOs for our site
- Provide system monitoring and incident alerts
- Participate in on-call rotations to provide support for critical incidents and outages.
- Develop plans for site reliability and disaster recovery
Requirements
- 5+ years of experience in a development focused DevOps/SRE role within a technology organization that has significant scale
- Deep experience with and proven success in developing software tools and automation wherever needed using Python and Golang
- Expertise with SQL, Linux, CI/CD, Kubernetes, Terraform to support a site/application within a large multi node infrastructure and a growing user base.
- Experience working with multiple cloud computing platforms such as GCP is also a must
- Demonstrated experience to successfully and reliably troubleshoot technical issues and challenges across a range of platforms and systems
- Experience with incident management and event postmortems
- Outstanding candidates will have one or more of the following:
- Familiarity with GPU clusters and/or HPC environments is preferred
- Experience with monitoring and logging tools such as Prometheus and Grafana
- Hands-on experience scaling a consumer product from early days into hypergrowth
Benefits
- 🩺 Top-notch health coverage for you & your family, with majority of the premium covered
- 💰 We invest in your future with a generous 401(K) contribution
- 🍼 New parents, we've got you covered with incredible paid leave -up to 20 weeks
- 🌴 4 weeks of PTO to explore, unwind & come back recharged
- 🍽️ Daily in-office catering plus a monthly Doordash stipend to help keep you fueled no matter where you are**
- ✨ Monthly wellness stipend to support you in your health journey
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonGolangSQLLinuxCI/CDKubernetesTerraformsite reliabilitydisaster recoveryautomation
Soft skills
collaborationtroubleshootingincident managementcommunicationproblem-solving