Senior Site Reliability Engineer

Instacart

. Develop scalable infrastructure strategies to ensure high availability, aligning infrastructure planning with product roadmaps, and optimize cost, risk and performance with cloud providers.

Posted 3/31/2026full-timeRemote • California, Colorado, Connecticut, District of Columbia, Hawaii, Illinois, Maine, Maryland, Massachusetts, New Hampshire, New Jersey, New York, Oregon, Pennsylvania, Rhode Island, Texas, Vermont, Virginia, Washington • 🇺🇸 United StatesSenior💰 $155,000 - $195,500 per yearWebsite

Tech Stack

Tools & technologies

AWSAzureCloudDockerGoGoogle Cloud PlatformKubernetesRuby

About the role

Key responsibilities & impact

Develop scalable infrastructure strategies to ensure high availability, aligning infrastructure planning with product roadmaps, and optimize cost, risk and performance with cloud providers.
Establish and lead incident management protocols and response plans to coordinate rapid responses, investigate root causes, prevent recurrence, and collaborate with security teams to test response readiness and address security risks.
Continuously monitor performance metrics and trends to proactively identify reliability risks. Regularly refine SLOs, SLIs, and Error Budgets to align with evolving standards and leverage data insights to propose improvement plans and suggest architectural updates to enhance system reliability.
Oversee regular system evaluations to pinpoint and refine process shortcomings and lead cross-functional projects that promote system optimization and minimize technical debt. Collaborate with product and engineering teams to ensure system enhancements align with user requirements.
Design and deploy automation tools to streamline deployment and operations, ensuring seamless processes while overseeing the continuous enhancement of automation scripts and frameworks, and rigorously monitor automated systems for performance and reliability. Address and tackle issues in automated environments promptly to reduce disruptions.
Provide technical guidance to junior colleagues, fostering a collaborative culture for problem-solving and innovation. Organize and lead knowledge-sharing sessions and coordinate training in site reliability best practices to enhance team proficiency.

Requirements

What you’ll need

Proven experience in programming
Robust knowledge of incident management processes and tools
Exemplary troubleshooting and problem-solving skills
Ability to work under pressure and prioritize tasks during high-stress situations
Expertise in scaling application infrastructure for high availability
Proficient in Ruby or Go
Experience with cloud platforms (eg, AWS, GCP, Azure) and containerization (eg, Docker, Kubernetes)
Skill in risk assessment for foundational infrastructure changes
Experience in monitoring system performance and trend analysis

Benefits

Comp & perks

Highly market-competitive compensation
New hire equity grant
Annual refresh grants

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

programmingincident managementtroubleshootingproblem-solvingscaling application infrastructureRubyGocloud platformscontainerizationmonitoring system performance

Soft Skills

ability to work under pressureprioritizationcollaborationknowledge sharingtraining coordination