
Senior Site Reliability Engineer
Barti
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇺🇸 United States
Visit company websiteSalary
💰 $150,000 - $200,000 per year
Job Level
Senior
Tech Stack
CloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTerraform
About the role
- Lead and participate in the design, implementation, and maintenance of highly available and scalable infrastructure.
- Monitor system health, performance metrics, and capacity planning to ensure optimal performance.
- Establish and track SLIs, SLOs, and error budgets to measure and improve system reliability.
- Design and implement Infrastructure as Code (IaC) solutions using tools like Terraform, Pulumi, or CloudFormation.
- Build and maintain CI/CD pipelines to enable rapid, safe deployments.
- Automate operational tasks and eliminate toil through scripting and tooling.
- Lead incident response efforts, including on-call rotation, post-mortem analysis, and remediation.
- Debug and resolve complex production issues across the entire stack.
- Implement monitoring, alerting, and observability solutions to detect and prevent issues proactively.
- Provide technical leadership and mentorship to engineers on reliability and infrastructure best practices.
- Collaborate with cross-functional teams, including Engineering and Product to ensure reliable product delivery.
- Lead the technical design of infrastructure solutions, ensuring alignment with architectural principles and business goals.
- Stay updated with emerging technologies and industry trends in SRE, DevOps, and cloud infrastructure.
- Propose and drive the adoption of best practices, tools, and processes to enhance system reliability and developer productivity.
- Conduct chaos engineering experiments and disaster recovery drills to validate system resilience.
- Implement and maintain security best practices across infrastructure and applications.
- Manage secrets, access controls, and security monitoring systems.
- Foster a collaborative environment within the engineering team and across departments.
- Clearly communicate technical concepts and system health to both technical and non-technical stakeholders.
- Work closely with engineering teams to define reliability requirements and ensure operational excellence.
Requirements
- 5+ years (ideally 7+) of relevant work experience in Site Reliability Engineering, DevOps, or Infrastructure roles
- 1+ years of hands-on experience with either Python, Go, or Bash scripting
- Experience with cloud platforms (ideally GCP) and container orchestration (Kubernetes, Docker)
- Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, or similar)
- Strong understanding of Linux systems, networking, and distributed systems
- Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or similar)
- Excellent problem-solving and communication skills
- Able to work independently and as part of a team
Benefits
- Be part of a mission-driven, rapidly scaling company changing the future of eye care
- Work remotely from anywhere in the U.S.
- Collaborate with a passionate, fun, and supportive team
- Competitive salary - $150,000 - $200,000
- Equity in a fast-growing startup
- Health, vision, and dental benefits
- Unlimited PTO
- Annual professional development stipend
- A high-impact role with plenty of room for growth, ownership, and creativity
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Infrastructure as CodePythonGoBash scriptingKubernetesDockerTerraformCloudFormationLinux systemsmonitoring and observability
Soft skills
problem-solvingcommunicationtechnical leadershipmentorshipcollaborationindependenceteamwork