
Site Reliability Engineer
SafeRide Health
full-time
Posted on:
Location Type: Remote
Location: United States
Visit company websiteExplore more
Tech Stack
About the role
- Keeping systems and services running smoothly with minimal downtime by focusing on availability, reliability, and scalability.
- Developing and maintaining tools and scripts to automate repetitive tasks such as deployments, configuration management, and monitoring.
- Implementing and managing monitoring and alerting systems to provide visibility into system performance and quickly detect potential issues.
- Responding to, diagnosing, and resolving system incidents, including conducting post-mortems to prevent future occurrences.
- Monitoring system resource usage to forecast future needs and scale systems accordingly to handle increasing user load.
- Collaborating with stakeholders to identify operational risks and implementing strategies to reduce their likelihood and impact.
- Analyzing metrics from operating systems and applications to identify areas for performance improvement.
Requirements
- Minimum of 5 years progressive experience in an IT, Software Engineering, Technology Operations, or Business Continuity role.
- Minimum of 2 years of hands-on experience in a Site Reliability, DevOps, or IT Observability role.
- Demonstrated proficiency with production monitoring and alerting tools (DataDog is a major plus!).
- Basic proficiency in an AWS containerized environment running infrastructure as code.
Benefits
- Competitive compensation and performance-based bonus potential
- Full medical, dental, and vision coverage
- Generous PTO and paid company holidays
- 401(k) with employer match
- Paid parental leave and family support benefits
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
automationconfiguration managementmonitoringincident responseperformance analysisinfrastructure as codesite reliabilityDevOpsIT observabilityproduction monitoring
Soft Skills
collaborationproblem-solvingrisk managementcommunicationanalytical thinking