Barti

Senior Site Reliability Engineer

Barti

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $150,000 - $200,000 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheusPythonTerraform

About the role

  • Lead and participate in the design, implementation, and maintenance of highly available and scalable infrastructure.
  • Monitor system health, performance metrics, and capacity planning to ensure optimal performance.
  • Establish and track SLIs, SLOs, and error budgets to measure and improve system reliability.
  • Design and implement Infrastructure as Code (IaC) solutions using tools like Terraform, Pulumi, or CloudFormation.
  • Build and maintain CI/CD pipelines to enable rapid, safe deployments.
  • Automate operational tasks and eliminate toil through scripting and tooling.
  • Lead incident response efforts, including on-call rotation, post-mortem analysis, and remediation.
  • Debug and resolve complex production issues across the entire stack.
  • Implement monitoring, alerting, and observability solutions to detect and prevent issues proactively.
  • Provide technical leadership and mentorship to engineers on reliability and infrastructure best practices.
  • Collaborate with cross-functional teams, including Engineering and Product to ensure reliable product delivery.
  • Lead the technical design of infrastructure solutions, ensuring alignment with architectural principles and business goals.
  • Stay updated with emerging technologies and industry trends in SRE, DevOps, and cloud infrastructure.
  • Propose and drive the adoption of best practices, tools, and processes to enhance system reliability and developer productivity.
  • Conduct chaos engineering experiments and disaster recovery drills to validate system resilience.
  • Implement and maintain security best practices across infrastructure and applications.
  • Manage secrets, access controls, and security monitoring systems.
  • Foster a collaborative environment within the engineering team and across departments.
  • Clearly communicate technical concepts and system health to both technical and non-technical stakeholders.
  • Work closely with engineering teams to define reliability requirements and ensure operational excellence.

Requirements

  • 5+ years (ideally 7+) of relevant work experience in Site Reliability Engineering, DevOps, or Infrastructure roles
  • 1+ years of hands-on experience with either Python, Go, or Bash scripting
  • Experience with cloud platforms (ideally GCP) and container orchestration (Kubernetes, Docker)
  • Proficiency with Infrastructure as Code tools (Terraform, CloudFormation, or similar)
  • Strong understanding of Linux systems, networking, and distributed systems
  • Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, or similar)
  • Excellent problem-solving and communication skills
  • Able to work independently and as part of a team
Benefits
  • Be part of a mission-driven, rapidly scaling company changing the future of eye care
  • Work remotely from anywhere in the U.S.
  • Collaborate with a passionate, fun, and supportive team
  • Competitive salary - $150,000 - $200,000
  • Equity in a fast-growing startup
  • Health, vision, and dental benefits
  • Unlimited PTO
  • Annual professional development stipend
  • A high-impact role with plenty of room for growth, ownership, and creativity

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Infrastructure as CodePythonGoBash scriptingKubernetesDockerTerraformCloudFormationLinux systemsmonitoring and observability
Soft skills
problem-solvingcommunicationtechnical leadershipmentorshipcollaborationindependenceteamwork